Adding basic NUMA awareness

[1]: /messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2

tomas@vondra.me

6 months ago

In reply to: Ashutosh Bapat (#2)

Re: Adding basic NUMA awareness

On 7/2/25 13:37, Ashutosh Bapat wrote:

On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:

3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch

Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.

The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.

This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?

If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.

My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization. I don't know how significant is the improvement, but if
it's measurable we could just do that independently of our patches.

There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.

I have added Dmitry to this thread since he has written most of the
shared memory handling code.

Thanks.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?

Indirectly. My patch can work just fine with a single segment, but being
able to enable huge pages only for some of the segments seems better.

But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).

I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.

Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.

Right. The good thing is that's just "advisory" information, it doesn't
break anything if it's temporarily out of sync. We don't need to "stop"
everything to remap the buffers to other nodes, or anything like that.
Or at least I think so.

It's one thing to "flip" the target mapping (determining which node a
buffer should be on), and actually migrating the buffers. The first part
can be done instantaneously, the second part can happen in the
background over a longer time period.

I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.

regards

--
Tomas Vondra

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

6 months ago

In reply to: Tomas Vondra (#3)

Re: Adding basic NUMA awareness

On Wed, Jul 2, 2025 at 6:06 PM Tomas Vondra <tomas@vondra.me> wrote:

I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.

Before shrinking the buffers, we walk the free list removing any
buffers that are going to be removed. When expanding, by linking the
new buffers in the order and then adding those to the already existing
free list. 0005 patch in [1]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2 has the code for the same.

--
Best Wishes,
Ashutosh Bapat

Dmitry Dolgov

9erthalion6@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#2)

Re: Adding basic NUMA awareness

On Wed, Jul 02, 2025 at 05:07:28PM +0530, Ashutosh Bapat wrote:

There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.

I have added Dmitry to this thread since he has written most of the
shared memory handling code.

Thanks! I like the idea behind this patch series. I haven't read it in
details yet, but I can imagine both patches (interleaving and online
resizing) could benefit from each other. In online resizing we've
introduced a possibility to use multiple shared mappings for different
types of data, maybe it would be convenient to use the same interface to
create separate mappings for different NUMA nodes as well. Using a
separate shared mapping per NUMA node would also make resizing easier,
since it would be more straightforward to fit an increased segment into
NUMA boundaries.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?

Right, separate segments would allow to mix and match huge pages with
pages of regular size. It's not implemented in the latest version of
online resizing patch, purely to reduce complexity and maintain the same
invariant (everything is either using huge pages or not) -- but we could
do it other way around as well.

[1]: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Tomas Vondra (#1)

Re: Adding basic NUMA awareness

On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi!

1) v1-0001-NUMA-interleaving-buffers.patch

[..]

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

Oh, now I get it! OK, let's stick to this one.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).

0. I think that we could do better, some counter arguments to
no-configuration-at-all:

a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)

b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?

c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)

d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:

2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.

Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:

1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:

numa_buffers_interleave=off numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
latency average = 0.373 ms
latency stddev = 0.237 ms
initial connection time = 45.899 ms
tps = 160242.147877 (without initial connection time)

numa_buffers_interleave=on numa_pgproc_interleave=on
latency average = 0.345 ms
latency stddev = 0.373 ms
initial connection time = 44.485 ms
tps = 177564.686094 (without initial connection time)

2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
default/numa_buffers_interleave=off
latency average = 1375.478 ms
latency stddev = 1141.423 ms
initial connection time = 46.104 ms
tps = 45.868075 (without initial connection time)

numa_buffers_interleave=on
latency average = 838.128 ms
latency stddev = 498.787 ms
initial connection time = 43.437 ms
tps = 75.413894 (without initial connection time)

and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
latency average = 727.717 ms
latency stddev = 410.767 ms
initial connection time = 45.119 ms
tps = 86.844161 (without initial connection time)

(but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?

3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)

4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?

5. In v1-0001, pg_numa_interleave_memory()

+                * XXX no return value, to make this fail on error, has to use
+                * numa_set_strict

Yes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.

+                * XXX Should we still touch the memory first, like
with numa_move_pages,
+                * or is that not necessary?

It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.

6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c

Accidental indents (also fails to apply)

7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?

8. v1-0005 2x + /* if (numa_procs_interleave) */

Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down

[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0

9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.

-J.

tomas@vondra.me

6 months ago

In reply to: Jakub Wartak (#6)

Re: Adding basic NUMA awareness

On 7/4/25 13:05, Jakub Wartak wrote:

On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi!

1) v1-0001-NUMA-interleaving-buffers.patch

[..]

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

Oh, now I get it! OK, let's stick to this one.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.

I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.

Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).

0. I think that we could do better, some counter arguments to
no-configuration-at-all:

a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)

b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?

c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)

d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:

I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.

2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.

I'm sorry, I don't understand what's the question :-(

Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:

1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:

numa_buffers_interleave=off numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
latency average = 0.373 ms
latency stddev = 0.237 ms
initial connection time = 45.899 ms
tps = 160242.147877 (without initial connection time)

numa_buffers_interleave=on numa_pgproc_interleave=on
latency average = 0.345 ms
latency stddev = 0.373 ms
initial connection time = 44.485 ms
tps = 177564.686094 (without initial connection time)

2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
default/numa_buffers_interleave=off
latency average = 1375.478 ms
latency stddev = 1141.423 ms
initial connection time = 46.104 ms
tps = 45.868075 (without initial connection time)

numa_buffers_interleave=on
latency average = 838.128 ms
latency stddev = 498.787 ms
initial connection time = 43.437 ms
tps = 75.413894 (without initial connection time)

and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
latency average = 727.717 ms
latency stddev = 410.767 ms
initial connection time = 45.119 ms
tps = 86.844161 (without initial connection time)

(but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.

Good, thanks for the testing. I should have done something like this
when I posted my patches, but I forgot about that (and the email felt
too long anyway).

But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.

For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.

So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?

Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?

3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)

Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.

wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?

4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?

Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.

But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.

But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.

Still have to figure out the details, but seems it might help.

5. In v1-0001, pg_numa_interleave_memory()
+                * XXX no return value, to make this fail on error, has to use
+                * numa_set_strict
Yes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.
+                * XXX Should we still touch the memory first, like
with numa_move_pages,
+                * or is that not necessary?
It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.

6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c

Accidental indents (also fails to apply)

7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?

Right, we need to add those. Or actually, we need to think about how
we'd do this for non-NUMA systems. I wonder if we even want to just
build everything the "old way" (without the partitions, etc.).

But per the earlier comment, the partitioning seems beneficial even on
non-NUMA systems, so maybe the shims are good enough OK.

8. v1-0005 2x + /* if (numa_procs_interleave) */

Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down

[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0

Yeah, good catch. I'll look into that next week.

9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.

Mostly experimental, for benchmarking etc. I agree we may not want to
mess with the task scheduling too much.

Thanks for the feedback!

regards

--
Tomas Vondra

cedric.villemain@data-bene.io

6 months ago

In reply to: Tomas Vondra (#1)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

Hi Tomas,

I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.

However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.

6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch

This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.

Driven by GUC "numa_procs_pin" (default: off).

In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.

We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.

The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:

CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]

CREATE PROFILE profile_name
[IF NOT EXISTS]
USING profile_manager
SET key = value [, key = value]...
[USING profile_manager
SET key = value [, key = value]...]
[...];

CREATE PROFILE MAPPING
[IF NOT EXISTS]
FOR PROFILE profile_name
[MATCH [ ALL | ANY ] (
[ROLE role_name],
[BACKEND TYPE backend_type],
[DATABASE database_name],
[APPLICATION appname]
)];

## PROFILE RESOLUTION ORDER

1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)

As currently designed, this approach allows quite a lot of flexibility:

* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary:
https://blogs.oracle.com/linux/post/numa-balancing )

* Someone else could implement support for Windows or BSD specifics.

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Best regards,
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

tomas@vondra.me

6 months ago

In reply to: Cédric Villemain (#8)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/5/25 09:09, Cédric Villemain wrote:

Hi Tomas,

I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.

However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.

6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch

This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.

Driven by GUC "numa_procs_pin" (default: off).

In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.

We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.

The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:

CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]

CREATE PROFILE profile_name
[IF NOT EXISTS]
USING profile_manager
    SET key = value [, key = value]...
[USING profile_manager
    SET key = value [, key = value]...]
[...];

CREATE PROFILE MAPPING
[IF NOT EXISTS]
FOR PROFILE profile_name
[MATCH [ ALL | ANY ] (
    [ROLE role_name],
    [BACKEND TYPE backend_type],
    [DATABASE database_name],
    [APPLICATION appname]
)];

## PROFILE RESOLUTION ORDER

1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)

As currently designed, this approach allows quite a lot of flexibility:

* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary: https://blogs.oracle.com/
linux/post/numa-balancing )

* Someone else could implement support for Windows or BSD specifics.

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?

regards

--
Tomas Vondra

#10

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Tomas Vondra (#7)

Re: Adding basic NUMA awareness

Hi Tomas, some more thoughts after the weekend:

On Fri, Jul 4, 2025 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote:

On 7/4/25 13:05, Jakub Wartak wrote:

On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi!

1) v1-0001-NUMA-interleaving-buffers.patch

[..]

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

Oh, now I get it! OK, let's stick to this one.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.

I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.

Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?

In my book, not being swappable is a win (it's hard for me to imagine
when it could be beneficial to swap out parts of s_b).

I was trying to think about it and also got those:

Anyway mlock() probably sounds like it, but e.g. Rocky 8.10 by default
has max locked memory (ulimit -l) as low as 64kB due to systemd's
DefaultLimitMEMLOCK, but Debian/Ubuntu have those at higher values.
Wasn't expecting that - those are bizzare low values. I think we would
need something like (10000*900)/1024/1024 or more, but with each
PGPROC on a separate page that would be even way more?

Another thing with 4kB pages: there's this big assumption now made
that once we arrive in InitProcess() we won't ever change NUMA node,
so we stick to the PGPROC from where we started (based on getcpu(2)).
Let's assume CPU scheduler reassigned us to differnt node, but we have
now this 4kB patch ready for PGPROC in theory and this means we would
need to rely on the NUMA autobalancing doing it's job to migrate that
4kB page from node to node (to get better local accesses instead of
remote ones). The questions in my head are now like that:
- but we have asked intially asked those PGPROC pages to be localized
on certain node (they have policy), so they won't autobalance? We
would need to somewhere call getcpu() again notice the difference and
unlocalize (clear the NUMA/mbind() policy) for the PGPROC page?
- mlocked() as above says stick to physical RAM page (?) , so it won't move?
- after what time kernel's autobalancing would migrate that page since
switching the active CPU<->node? I mean do we execute enough reads on
this page?

BTW: to move this into pragmatic real, what's the most
one-liner/trivial way to exercise/stress PGPROC?

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).

0. I think that we could do better, some counter arguments to
no-configuration-at-all:

a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)

b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?

c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)

d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:

I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.

2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.

I'm sorry, I don't understand what's the question :-(

That patch reference above, it was a chain of thought from step "d".
What I had in mind was that you cannot remove the patch
`v1-0002-NUMA-localalloc.patch` from the scope if forcing people to
use numactl by not having enough configurability on the PG side. That
is: if someone will have to use systemd+numactl
--interleave/--weighted-interleave then, he will also need to have a
way to use numa_localalloc=on (to override the new/user's policy
default, otherwise local mem allocations are also going to be
interleaved, and we are back to square one). Which brings me to a
point why instead of this toggle, should include the configuration
properly inside from start (it's not that hard apparently).

Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:

1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:

[.. removed numbers]

But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.

For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.

Well, pretty much the same here. I was after minimizing "stddev" (to
have better predictability of results, especially across restarts) and
increasing available bandwidth [which is pretty much related]. Without
our NUMA work, PG can just put that s_b on any random node or spill
randomly from to another (depending on size of allocation request).

So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?

Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?

Sure thing, marked it as `Returned with feedback`, this approach seems
to be much more advanced.

3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)

Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.

wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?

I'm trying to convey exactly the opposite message or at least that it
might depend on configuration. Please see
/messages/by-id/CAKZiRmxYMPbQ4WiyJWh=Vuw_Ny+hLGH9_9FaacKRJvzZ-smm+w@mail.gmail.com
(btw it should read there that I don't indent spend a lot of thime on
PQ), but anyway: I think we should NOT pin the PQ workers the same
NODE as you do not know if there's CPU left there (same story as with
v1-0006 here).

I'm just proposing quick OS-based interleaving of PQ shm if using all
nodes, literally:

@@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
request_size,
     }
     *mapped_address = address;
     *mapped_size = request_size;
+
+    /* We interleave memory only at creation time. */
+    if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) {
+        elog(DEBUG1, "interleaving shm mem @ %p size=%zu",
*mapped_address, *mapped_size);
+        pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes);
+    }
+

Because then if memory is interleaved you have probably less variance
for memory access. But also from that previous thread:

"So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this"

So wild idea was: maybe PQ shm interleaving should on NUMA
configuration (if intereavling to all nodes, then interleave normally,
but if configuration sets to just 1 NUMA node, it automatically binds
there -- there was '@' support for that in my patch).

4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?

Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.

Sure!

But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.

But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.

Still have to figure out the details, but seems it might help.

Right, no idea how the shared_memory remapping patch will work
(how/when the s_b change will be executed), but we could somehow mark
that number of NUMA zones could be rechecked during SIGHUP (?) and
then just simple compare check if old_numa_num_configured_nodes ==
new_numa_num_configured_nodes is true.

Anyway, I think it's way too advanced for now, don't you think? (like
CPU ballooning [s_b itself] is rare, and NUMA ballooning seems to be
super-wild-rare).

As for the rest, forgot to include this too: getcpu() - this really
needs a portable pg_getcpu() wrapper.

-J.

#11

cedric.villemain@data-bene.io

6 months ago

In reply to: Tomas Vondra (#9)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?

I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#12

tomas@vondra.me

6 months ago

In reply to: Cédric Villemain (#11)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/7/25 16:51, Cédric Villemain wrote:

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?

I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).

I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...

The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.

Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.

The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).

regards

--
Tomas Vondra

#13

andres@anarazel.de

6 months ago

In reply to: Cédric Villemain (#8)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

Hi,

On 2025-07-05 07:09:00 +0000, Cï¿½dric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.

Greetings,

Andres Freund

#14

cedric.villemain@data-bene.io

6 months ago

In reply to: Tomas Vondra (#12)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/7/25 16:51, Cédric Villemain wrote:

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?

I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).

I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...

The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.

Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.

The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).

The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?

Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

...

pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
pmroutine->init_auxilliary != NULL)
pmroutine->init_auxilliary(MyProc);

Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#15

cedric.villemain@data-bene.io

6 months ago

In reply to: Andres Freund (#13)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

Hi Andres,

Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.

Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#16

cedric.villemain@data-bene.io

6 months ago

In reply to: Cédric Villemain (#14)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/7/25 16:51, Cédric Villemain wrote:

* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.

Hope this perspective is helpful.

Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?

I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).

I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...

The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.

Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.

The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).

The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?

Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
    pmroutine->init_process != NULL)
    pmroutine->init_process(MyProc);

...

pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
    pmroutine->init_auxilliary != NULL)
    pmroutine->init_auxilliary(MyProc);

Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).

After a first read I think this works for patches 002 and 005. For this
last one, InitProcGlobal() may setup things as you do but then expose
the choice a bit later, basically in places where you added the if
condition on the GUC: numa_procs_interleave).

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#17

andres@anarazel.de

6 months ago

In reply to: Jakub Wartak (#6)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:

On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.

Greetings,

Andres Freund

#18

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#17)

Re: Adding basic NUMA awareness

On 7/8/25 05:04, Andres Freund wrote:

Hi,

On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:

On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.

regards

--
Tomas Vondra

#19

tomas@vondra.me

6 months ago

In reply to: Cédric Villemain (#15)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,

Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.

Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).

But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.

regards

--
Tomas Vondra

#20

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#18)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:

On 7/8/25 05:04, Andres Freund wrote:

On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.

I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.

Greetings,

Andres Freund

#21

cedric.villemain@data-bene.io

6 months ago

In reply to: Tomas Vondra (#19)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,

Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.

Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).

But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.

Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.

Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#22

tomas@vondra.me

6 months ago

In reply to: Cédric Villemain (#21)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/8/25 18:06, Cédric Villemain wrote:

On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,

Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.

Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).

But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.

Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.

Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...

Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.

I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?

regards

--
Tomas Vondra

#23

cedric.villemain@data-bene.io

6 months ago

In reply to: Tomas Vondra (#22)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/8/25 18:06, Cédric Villemain wrote:

On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,

Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a way.

I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.

Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).

But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.

Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.

Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...

Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.

I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?

That's the whole point: there are very distinct usages of PostgreSQL in
the field. And maybe not all of them will require the policy defined by
PostgreSQL core.

May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.

There are some very early processing for memory setup that I can see as
a current blocker, and here I'd refer a more compliant NUMA api as
proposed by Jakub so it's possible to arrange based on workload,
hardware configuration or other matters. Reworking to get distinct
segment and all as you do is great, and combo of both approach probably
of great interest. There is also this weighted interleave discussed and
probably much more to come in this area in Linux.

I think some points raised already about possible distinct policies, I
am precisely claiming that it is hard to come with one good policy with
limited setup options, thus requirement to keep that flexible enough
(hooks, api, 100 GUc ?).

There is an EPYC story here also, given the NUMA setup can vary
depending on BIOS setup, associated NUMA policy must probably take that
into account (L3 can be either real cache or 4 extra "local" NUMA nodes
- with highly distinct access cost from a RAM module).
Does that change how PostgreSQL will place memory and process? Is it
important or of interest ?

--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D

#24

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

6 months ago

In reply to: Cédric Villemain (#23)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

Hi,

On Wed, Jul 09, 2025 at 06:40:00AM +0000, Cï¿½dric Villemain wrote:

On 7/8/25 18:06, Cï¿½dric Villemain wrote:
I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?

That's the whole point: there are very distinct usages of PostgreSQL in the
field. And maybe not all of them will require the policy defined by
PostgreSQL core.

May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.

There are some very early processing for memory setup that I can see as a
current blocker, and here I'd refer a more compliant NUMA api as proposed by
Jakub so it's possible to arrange based on workload, hardware configuration
or other matters. Reworking to get distinct segment and all as you do is
great, and combo of both approach probably of great interest.

I think that Tomas's approach helps to have more "predictable" performance
expectations, I mean more consistent over time, fewer "surprises".

While your approach (and Jakub's one)) could help to get performance gains
depending on a "known" context (so less generic).

So, probably having both could make sense but I think that they serve different
purposes.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#25

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Andres Freund (#20)

Re: Adding basic NUMA awareness

On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:

On 7/8/25 05:04, Andres Freund wrote:

On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.

Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1]https://lkml.org/lkml/2024/7/3/352[2]https://lkml.rescloud.iu.edu/2402.2/03227.html (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.

Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...

So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic.

With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

I don't have any estimates how often this happens, e.g. for older tasks.

We could measure, kernel 6.16+ has per PID numa_task_migrated in
/proc/{PID}/sched , but I assume we would have to throw backends >>
VCPUs at it, to simulate reality and do some "waves" between different
activity periods of certain pools (I can imagine worst case scenario:
a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep
for a while
b) pgbench "b" open some $VCPU new connections to some other DB, all
active from start (tpcbb or readonly)
c) manually ping CPUs using taskset for each PID all from "b" to
specific NUMA node #2 -- just to simulate unfortunate app working on
every 2nd conn
d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node
#1 is idle, #2 is full, CPU scheduler starts puting "a" backends on
CPUs from #1 , and we should notice PIDs being migrated)

I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.

-J.

p.s. I hope i did write in an understandable way, because I had many
interruptions, so if anything is unclear please let me know.

[1]: https://lkml.org/lkml/2024/7/3/352
[2]: https://lkml.rescloud.iu.edu/2402.2/03227.html

#26

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#3)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote:

On 7/2/25 13:37, Ashutosh Bapat wrote:

On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:

3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch

Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.

The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.

Yea, I don't see any point in adding buffers to the tail instead of to the
front. We probably want more recently used buffers at the front, since they
(and the associated BufferDesc) are more likely to be in a CPU cache.

This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?

If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.

My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization.

Optimiziation / simplification. When building multiple freelists it was
harder to maintain the tail pointer, and since it was never used...

+1 to just applying that part.

I don't know how significant is the improvement, but if it's measurable we
could just do that independently of our patches.

I doubt it's really an improvement in any realistic scenario, but it's also
not a regression in any way, since it's never used...

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single thread than
the freelist, clock sweep scales *considerably* better [1]A single pg_prewarm of a large relation shows a difference between using the freelist and not that's around the noise level, whereas 40 parallel pg_prewarms of seperate relations is over 5x faster when disabling the freelist.. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread (rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Also needing to switch between getting buffers from the freelist and the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.

Greetings,

Andres Freund

[1]: A single pg_prewarm of a large relation shows a difference between using the freelist and not that's around the noise level, whereas 40 parallel pg_prewarms of seperate relations is over 5x faster when disabling the freelist.

A single pg_prewarm of a large relation shows a difference between using the
freelist and not that's around the noise level, whereas 40 parallel
pg_prewarms of seperate relations is over 5x faster when disabling the
freelist.

For the test:
- I modified pg_buffercache_evict_* to put buffers onto the freelist

- Ensured all of shared buffers is allocated by querying
pg_shmem_allocations_numa, as otherwise the workload is dominated by the
kernel zeroing out buffers

- used shared_buffers bigger than the data

- data for single threaded is 9.7GB, data for the parallel case is 40
relations of 610MB each.

- in the single threaded case I pinned postgres to a single core, to make sure
core-to-core variation doesn't play a role

- single threaded case

c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_large');") -t1

concurrent case:

c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');") -t1

#27

Greg Burd

greg@burd.me

6 months ago

In reply to: Andres Freund (#26)

Re: Adding basic NUMA awareness

On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.

Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

If you're not already coding this, I'll jump in. :)

That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.

100% agree here, very clever approach adapting clock sweep to a NUMA world.

best.

-greg

Show quoted text

Greetings,

Andres Freund

#28

andres@anarazel.de

6 months ago

In reply to: Jakub Wartak (#25)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-09 12:04:00 +0200, Jakub Wartak wrote:

On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:

On 7/8/25 05:04, Andres Freund wrote:

On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.

Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?

Unfortunately it's probably going to be slightly more complicated workloads
that show the effect - the very simplest cases don't go iterate through the
procarray itself anymore.

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning

I'm not really bought into this being a problem. If your system has enough
pressure to swap out the PGPROC array, you're so hosed that this won't make a
difference.

- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1][2] (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.

Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

I'm rather doubtful that it's a good idea to combine numa awareness with numa
balancing. Numa balancing adds latency and makes it much more expensive for
userspace to act in a numa aware way, since it needs to regularly update its
knowledge about where memory resides.

Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...

So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?

If we used 4k pages for the procarray we would just have ~4 procs on one page,
if that range were marked as interleaved, it'd probably suffice.

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic.

With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

I doubt that NUMA balancing is going to help a whole lot here, there are too
many procs on one page for that to be helpful. One thing that might be worth
doing is to *increase* the size of PGPROC by moving other pieces of data that
are keyed by ProcNumber into PGPROC.

I think the main thing to avoid is the case where all of PGPROC, buffer
mapping table, ... resides on one NUMA node (e.g. because it's the one
postmaster was scheduled on), as the increased memory traffic will lead to
queries on that node being slower than the other node.

Greetings,

Andres Freund

#29

andres@anarazel.de

6 months ago

In reply to: Greg Burd (#27)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-09 12:55:51 -0400, Greg Burd wrote:

On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.

Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...

Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

If you're not already coding this, I'll jump in. :)

My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().

Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.

There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.

Greetings,

Andres Freund

#30

andres@anarazel.de

6 months ago

In reply to: Cédric Villemain (#21)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

Hi,

On 2025-07-08 16:06:00 +0000, Cï¿½dric Villemain wrote:

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.

Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to any
process executing code from a module.

Parts of your code where you assign/define policy could be in one or more
relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able to
adjust when hardware and linux kernel changes.

I am doubtful this makes sense - as you can see patch 05 needs to change a
fair bit of core code to make this work, there's no way we can delegate much
of that to an extension.

But even if it's doable, I think it's *very* premature to focus on such
extensibility at this point - we need to get the basics into a mergeable
state, if you then want to argue for adding extensibility, we can do that at
this stage. Trying to design this for extensibility from the get go, where
that extensibility is very unlikely to be used widely, seems rather likely to
just tank this entire project without getting us anything in return.

Greetings,

Andres Freund

#31

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#1)

Re: Adding basic NUMA awareness

Hi,

Thanks for working on this! I think it's an area we have long neglected...

On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:

Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.

Wonder if some of it might be worth putting into a multi-valued GUC (like
debug_io_direct).

1) v1-0001-NUMA-interleaving-buffers.patch

This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.

Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).

Two more disadvantages:

With OS interleaving postgres doesn't (not easily at least) know about what
maps to what, which means postgres can't do stuff like numa aware buffer
replacement.

With OS interleaving the interleaving is "too fine grained", with pages being
mapped at each page boundary, making it less likely for things like one
strategy ringbuffer to reside on a single numa node.

I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
pages...

4) v1-0004-NUMA-partition-buffer-freelist.patch

Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:

* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.

* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.

* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.

* none - nothing, sigle freelist

Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.

I think we might eventually want something more granular than just "node" -
the freelist (and the clock sweep) can become a contention point even within
one NUMA node. I'm imagining something like an array of freelists/clocksweep
states, where the current numa node selects a subset of the array and the cpu
is used to choose the entry within that list.

But we can do that later, that should be a fairly simple extension of what
you're doing.

The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).

I do have a separate experimental patch doing something like that, I
need to make it part of this branch.

I'm really curious about that patch, as I wrote elsewhere in this thread, I
think we should just get rid of the freelist alltogether. Even if we don't do
so, in a steady state system the clock sweep is commonly much more important
than the freelist...

5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

We should probably pad them regardless? Right now sizeof(PGPROC) happens to be
multiple of 64 (i.e. the most common cache line size), but that hasn't always
been the case, and isn't the case on systems with 128 bit cachelines like
common ARMv8 systems. And having one cacheline hold one backends fast path
states and another backend's xmin doesn't sound like a recipe for good
performance.

Seems like we should also do some reordering of the contents within PGPROC. We
have e.g. have very frequently changing data (->waitStatus, ->lwWaiting) in
the same caceheline as almost immutable data (->pid, ->pgxactoff,
->databaseId,).

So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.

Theoretically we could use the "padding" memory at the end of each NUMA node's
PGPROC array to for the 2PC entries, for those we presumably don't care for
locality. Not sure it's worth the complexity though.

For a while I thought I had a better solution: Given that we're going to waste
all the "padding" memory, why not just oversize the PGPROC array so that it
spans the required number of NUMA nodes?

The problem is that that would lead to ProcNumbers to get much larger, and we
do have other arrays that are keyed by ProcNumber. Which probably makes this
not so great an idea.

This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.

But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.

There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.

I'd not be surprised if there were overhead, adding a level of indirection to
things like ProcArrayGroupClearXid(), GetVirtualXIDsDelayingChkpt(),
SignalVirtualTransaction() probably won't be free.

BUT: For at least some of these a better answer might be to add additional
"dense" arrays like we have for xids etc, so they don't need to trawl through
PGPROCs.

There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).

I guess it might be worth partitioning the freelist, iterating through a few
thousand links just to discover that there's no free proc on the current numa
node, while holding a spinlock, doesn't sound great. Even if it's likely
rarely a huge issue compared to other costs.

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).

That seems like the right thing to me.

One thing that this patchset afaict doesn't address so far is that there is a
fair bit of other important shared memory that this patch doesn't set up
intelligently e.g. the buffer mapping table itself (but there are loads of
other cases). Because we touch a lot of that memory during startup, most it
will be allocated on whatever NUMA node postmaster was scheduled. I suspect
that the best we can do for parts of shared memory where we don't have
explicit NUMA awareness is to default to an interleave policy.

From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 6 May 2025 21:12:21 +0200
Subject: [PATCH v1 1/6] NUMA: interleaving buffers

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc. It's less dependent on what the CPU
scheduler does, etc.

FWIW, I don't think zone_reclaim_mode will commonly do that? Even if enabled,
which I don't think it is anymore by default. At least huge pages can't be
reclaimed by the kernel, but even when not using huge pages, I think the only
scenario where that would happen is if shared_buffers were swapped out.

Numa balancing might eventually "fix" such an imbalance though.

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..2ad34624c49 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
*/
#include "postgres.h"

+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+

I wonder how much of this we should try to put into port/pg_numa.c. Having
direct calls to libnuma code all over the backend will make it rather hard to
add numa awareness for hypothetical platforms not using libnuma compatible
interfaces.

+/* number of buffers allocated on the same NUMA node */
+static int64 numa_chunk_buffers = -1;

Given that NBuffers is a 32bit quantity, this probably doesn't need to be
64bit... Anyway, I'm not going to review on that level going forward, the
patch is probably in too early a state for that.

@@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
foundDescs,
foundIOCV,
foundBufCkpt;
+	Size		mem_page_size;
+	Size		buffer_align;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.

It can run in single user mode - but that shouldn't prevent us from using
pg_get_shmem_pagesize().

+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */

Ugh, not seeing a great way to deal with that either.

+	 * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).

I think that's right - there's no point in using 1GB pages for anything other
than shared_buffers, we should allocate shared_buffers separately.

+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.

Doing it twice sounds somewhat nasty - but perhaps we could just have the
shmem size infrastructure compute two different numbers, one for use with huge
pages and one without?

+static int64
+choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
+{
+	int64		num_items;
+	int64		max_items;
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of items to fill a memory page with descriptors and
+	 * blocks. The NUMA allocates memory in pages, and we need to do that for
+	 * both buffers and descriptors.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 * But it's clearer this way.
+	 */
+	num_items = Max(mem_page_size / sizeof(BufferDescPadded),
+					mem_page_size / BLCKSZ);
+
+	/*
+	 * We shouldn't use chunks larger than NBuffers/num_nodes, because with
+	 * larger chunks the last NUMA node would end up with much less memory (or
+	 * no memory at all).
+	 */
+	max_items = (NBuffers / num_nodes);
+
+	/*
+	 * Did we already exceed the maximum desirable chunk size? That is, will
+	 * the last node get less than one whole chunk (or no memory at all)?
+	 */
+	if (num_items > max_items)
+		elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
+			 num_items, max_items);
+
+	/* grow the chunk size until we hit the max limit. */
+	while (2 * num_items <= max_items)
+		num_items *= 2;

Something around this logic leads to a fair bit of imbalance - I started postgres with
huge_page_size=1GB, shared_buffers=4GB on a 2 node system and that results in

postgres[4188255][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│ name │ numa_node │ size │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks │ 0 │ 5368709120 │
│ Buffer Blocks │ 1 │ 1073741824 │
│ Buffer Descriptors │ 0 │ 1073741824 │
│ Buffer Descriptors │ 1 │ 1073741824 │
└────────────────────┴───────────┴────────────┘
(4 rows)

With shared_buffers=8GB postgres failed to start, even though 16 1GB huge
pages are available, as 18GB were requested.

After increasing the limit, the top allocations were as follows:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations ORDER BY allocated_size DESC LIMIT 5;
┌──────────────────────┬─────────────┬────────────┬────────────────┐
│ name │ off │ size │ allocated_size │
├──────────────────────┼─────────────┼────────────┼────────────────┤
│ Buffer Blocks │ 1192223104 │ 9663676416 │ 9663676416 │
│ PGPROC structures │ 10970279808 │ 3221733342 │ 3221733376 │
│ Fast-Path Lock Array │ 14192013184 │ 3221396544 │ 3221396608 │
│ Buffer Descriptors │ 51372416 │ 1140850688 │ 1140850688 │
│ (null) │ 17468590976 │ 785020032 │ 785020032 │
└──────────────────────┴─────────────┴────────────┴────────────────┘

With a fair bit of imbalance:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│ name │ numa_node │ size │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks │ 0 │ 8589934592 │
│ Buffer Blocks │ 1 │ 2147483648 │
│ Buffer Descriptors │ 0 │ 0 │
│ Buffer Descriptors │ 1 │ 2147483648 │
└────────────────────┴───────────┴────────────┘
(4 rows)

Note that the buffer descriptors are all on node 1.

+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA interleaving */
+	if (numa_chunk_buffers == -1)
+		return -1;
+
+	return (buffer / numa_chunk_buffers) % numa_nodes;
+}

FWIW, this is likely rather expensive - when not a compile time constant,
divisions and modulo can take a fair number of cycles.

+/*
+ * pg_numa_interleave_memory
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ *              of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_interleave_memory(char *startptr, char *endptr,
+						  Size mem_page_size, Size chunk_size,
+						  int num_nodes)
+{

Seems like this should be in pg_numa.c?

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 69b6a877dc9..c07de903f76 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c

I assume those changes weren't intentionally part of this patch...

From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v1 4/6] NUMA: partition buffer freelist

Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.

There are four strategies, specified by GUC numa_partition_freelist

* none - single long freelist, should work just like now

* node - one freelist per NUMA node, with only buffers from that node

* cpu - one freelist per CPU

* pid - freelist determined by PID (same number of freelists as 'cpu')

When allocating a buffer, it's taken from the correct freelist (e.g.
same NUMA node).

Note: This is (probably) more important than partitioning ProcArray.

+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+	/* Spinlock: protects the values below */
+	slock_t		freelist_lock;
+
+	/*
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres.
+	 */
+	int			firstFreeBuffer __attribute__((aligned(64)));	/* Head of list of
+																 * unused buffers */
+
+	/* Number of buffers consumed from this list. */
+	uint64		consumed;
+}			BufferStrategyFreelist;

I think this might be a leftover from measuring performance of a *non*
partitioned freelist. I saw unnecessar contention between

BufferStrategyControl->{nextVictimBuffer,buffer_strategy_lock,numBufferAllocs}

and was testing what effect the simplest avoidance scheme has.

I don't this should be part of this patchset.

/*
* The shared freelist control information.
@@ -39,8 +66,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;

-	int			firstFreeBuffer;	/* Head of list of unused buffers */
-
/*
* Statistics.  These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -51,13 +76,27 @@ typedef struct
/*
* Bgworker process to be notified upon activity or -1 if none. See
* StrategyNotifyBgWriter.
+	 *
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres. Also, shouldn't the alignment be specified after, like for
+	 * "consumed"?
*/
-	int			bgwprocno;
+	int			__attribute__((aligned(64))) bgwprocno;
+
+	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
} BufferStrategyControl;

Here the reason was that it's silly to put almost-readonly data (like
bgwprocno) onto the same cacheline as very frequently modified data like
->numBufferAllocs. That causes unnecessary cache misses in many
StrategyGetBuffer() calls, as another backend's StrategyGetBuffer() will
always have modified ->numBufferAllocs and either ->buffer_strategy_lock or
->nextVictimBuffer.

+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+	unsigned	cpu;
+	unsigned	node;
+	int			rc;
+
+	int			freelist_idx;
+
+	/* freelist not partitioned, return the first (and only) freelist */
+	if (numa_partition_freelist == FREELIST_PARTITION_NONE)
+		return &StrategyControl->freelists[0];
+
+	/*
+	 * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+	 * list based on that.
+	 */
+	rc = getcpu(&cpu, &node);
+	if (rc != 0)
+		elog(ERROR, "getcpu failed: %m");

Probably should put this into somewhere abstracted away...

+	/*
+	 * Pick the freelist, based on CPU, NUMA node or process PID. This matches
+	 * how we built the freelists above.
+	 *
+	 * XXX Can we rely on some of the values (especially strategy_nnodes) to
+	 * be a power-of-2? Then we could replace the modulo with a mask, which is
+	 * likely more efficient.
+	 */
+	switch (numa_partition_freelist)
+	{
+		case FREELIST_PARTITION_CPU:
+			freelist_idx = cpu % strategy_ncpus;

As mentioned earlier, modulo is rather expensive for something executed so
frequently...

+			break;
+
+		case FREELIST_PARTITION_NODE:
+			freelist_idx = node % strategy_nnodes;
+			break;

Here we shouldn't need modulo, right?

+
+		case FREELIST_PARTITION_PID:
+			freelist_idx = MyProcPid % strategy_ncpus;
+			break;
+
+		default:
+			elog(ERROR, "unknown freelist partitioning value");
+	}
+
+	return &StrategyControl->freelists[freelist_idx];
+}

/* size of lookup hash table ... see comment in StrategyInitialize */
size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));

/* size of the shared replacement strategy control block */
-	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+	size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+	/*
+	 * Allocate one frelist per CPU. We might use per-node freelists, but the
+	 * assumption is the number of CPUs is less than number of NUMA nodes.
+	 *
+	 * FIXME This assumes the we have more CPUs than NUMA nodes, which seems
+	 * like a safe assumption. But maybe we should calculate how many elements
+	 * we actually need, depending on the GUC? Not a huge amount of memory.

FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
very well could end up as more NUMA nodes than cores.

Ugh, -ETOOLONG. Gotta schedule some other things...

Greetings,

Andres Freund

#32

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Andres Freund (#28)

Re: Adding basic NUMA awareness

On Wed, Jul 9, 2025 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:

Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning

I'm not really bought into this being a problem. If your system has enough
pressure to swap out the PGPROC array, you're so hosed that this won't make a
difference.

OK I need to bend here, yet still part of me believes that the
situation where we have hugepages (for 'Buffer Blocks') and yet some
smaller more, but way critical structs are more likely to be swapped
out due to pressure of some backend-gone-wild random mallocs() is
unhealthy to me (especially the fact the OS might prefer swapping on
per node rather than global picture)

I'm rather doubtful that it's a good idea to combine numa awareness with numa
balancing. Numa balancing adds latency and makes it much more expensive for
userspace to act in a numa aware way, since it needs to regularly update its
knowledge about where memory resides.

Well the problem is that backends come here and go to random CPUs
often (migrated++ on very high backend counts and non-uniform
workloads in terms of backend-CPU usage), but the autobalancing
doesn't need to be on or off for everything. It could be autobalancing
for a certain memory region and it is not affecting the app in any way
(well, other than those minor page faulting, literally ).

If we used 4k pages for the procarray we would just have ~4 procs on one page,
if that range were marked as interleaved, it'd probably suffice.

OK, this sounds like the best and simplest proposal to me, yet the
patch doesn't do OS-based interleaving for those today. Gonna try that
mlock() sooner or later... ;)

-J.

#33

[1]: https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Andres Freund (#31)

Re: Adding basic NUMA awareness

On Wed, Jul 9, 2025 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:

Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.

Wonder if some of it might be worth putting into a multi-valued GUC (like
debug_io_direct).

Long-term or for experimentation? Also please see below as it is related:

[..]

FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
very well could end up as more NUMA nodes than cores.

In my earlier apparently very way too naive approach, I've tried to
handle this CXL scenario, but I'm afraid this cannot be done without
further configuration, please see review/use cases [1]/messages/by-id/attachment/178119/v4-0001-Add-capability-to-interleave-shared-memory-across.patch - just see sgml/GUC and we have numa_parse_nodestring(3) and [2]/messages/by-id/aAKPMrX1Uq6quKJy@ip-10-97-1-34.eu-west-3.compute.internal

-J.

[1]: /messages/by-id/attachment/178119/v4-0001-Add-capability-to-interleave-shared-memory-across.patch - just see sgml/GUC and we have numa_parse_nodestring(3)
- just see sgml/GUC and we have numa_parse_nodestring(3)
[2]: /messages/by-id/aAKPMrX1Uq6quKJy@ip-10-97-1-34.eu-west-3.compute.internal

#34

Burd, Greg

greg@burd.me

6 months ago

In reply to: Andres Freund (#29)

Re: Adding basic NUMA awareness

On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-07-09 12:55:51 -0400, Greg Burd wrote:

On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.

Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...

Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

If you're not already coding this, I'll jump in. :)

My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().

Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.

I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the work and cross-reference when I have something reasonable to show (hopefully later today).

There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that.

Working on it. Other than NUMA-fying clocksweep there is a function have_free_buffer() that might be a tad tricky to re-implement efficiently and/or make NUMA aware. Or maybe I can remove that too? It is used in autoprewarm.c and possibly other extensions, but no where else in core.

However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.

Yep, I noted this counter and its potential for contention too. Fortunately, it seems like it is only used so that "bgwriter can estimate the rate of buffer consumption" which to me opens the door to a less accurate partitioned counter, perhaps something lock-free (no mutex/CAS) that is bucketed then combined when read.

A quick look at bufmgr.c indicates that recent_allocs (which is StrategyControl->numBufferAllocs) is used to track a "moving average" and other voodoo there I've yet to fully grok. Any thoughts on this approximate count approach?

Also, what are your thoughts on updating the algorithm to CLOCK-Pro [1]https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html while I'm there? I guess I'd have to try it out, measure it a lot and see if there are any material benefits. Maybe I'll keep that for a future patch, or at least layer it... back to work!

Greetings,

Andres Freund

best.

-greg

#35

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

6 months ago

In reply to: Andres Freund (#31)

Re: Adding basic NUMA awareness

Hi,

On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:

Hi,

Thanks for working on this!

Indeed, thanks!

On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:

1) v1-0001-NUMA-interleaving-buffers.patch

This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.

Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).

Two more disadvantages:

With OS interleaving postgres doesn't (not easily at least) know about what
maps to what, which means postgres can't do stuff like numa aware buffer
replacement.

With OS interleaving the interleaving is "too fine grained", with pages being
mapped at each page boundary, making it less likely for things like one
strategy ringbuffer to reside on a single numa node.

There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.

I do think this is a big advantage as compare to the OS interleaving.

I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
pages...

I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?

5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
cache line size)

Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.

With a bit of reordering:

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5cb1632718e..2ed2f94202a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,8 +194,6 @@ struct PGPROC
                                                                 * vacuum must not remove tuples deleted by
                                                                 * xid >= xmin ! */

- int procnumber; /* index in ProcGlobal->allProcs */
-
int pid; /* Backend's process ID; 0 if prepared xact */

int pgxactoff; /* offset into various ProcGlobal->arrays with
@@ -243,6 +241,7 @@ struct PGPROC

/* Support for condition variables. */
proclist_node cvWaitLink; /* position in CV wait list */
+ int procnumber; /* index in ProcGlobal->allProcs */

        /* Info about lock the process is currently waiting for, if any. */
        /* waitLock and waitProcLock are NULL if not currently waiting. */
@@ -268,6 +267,7 @@ struct PGPROC
         */
        XLogRecPtr      waitLSN;                /* waiting for this LSN or higher */
        int                     syncRepState;   /* wait state for sync rep */
+       int                     numa_node;
        dlist_node      syncRepLinks;   /* list link if process is in syncrep queue */

        /*
@@ -321,9 +321,6 @@ struct PGPROC
        PGPROC     *lockGroupLeader;    /* lock group leader, if I'm a member */
        dlist_head      lockGroupMembers;       /* list of members, if I'm a leader */
        dlist_node      lockGroupLink;  /* my member link, if I'm a member */
-
-       /* NUMA node */
-       int                     numa_node;
 };

That could be back to 832 (the order does not make sense logically speaking
though).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#36

tomas@vondra.me

6 months ago

In reply to: Cédric Villemain (#23)

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

On 7/9/25 08:40, Cédric Villemain wrote:

On 7/8/25 18:06, Cédric Villemain wrote:

On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,

Hi,

On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a
way.

I think it's actually the opposite - whenever we pushed stuff like
this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.

Except the backend pinning which is easy to achieve, thus my
comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).

But an "optimal backend placement" seems to very much depend on
where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.

Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.

Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...

Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.

I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?

That's the whole point: there are very distinct usages of PostgreSQL in
the field. And maybe not all of them will require the policy defined by
PostgreSQL core.

May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.

The complexity of making everything extensible in an arbitrary way. To
make it extensible in a useful, we need to have a reasonably clear idea
what aspects need to be extensible, and what's the goal.

There are some very early processing for memory setup that I can see as
a current blocker, and here I'd refer a more compliant NUMA api as
proposed by Jakub so it's possible to arrange based on workload,
hardware configuration or other matters. Reworking to get distinct
segment and all as you do is great, and combo of both approach probably
of great interest. There is also this weighted interleave discussed and
probably much more to come in this area in Linux.

I think some points raised already about possible distinct policies, I
am precisely claiming that it is hard to come with one good policy with
limited setup options, thus requirement to keep that flexible enough
(hooks, api, 100 GUc ?).

I'm sorry, I don't want to sound too negative, but "I want arbitrary
extensibility" is not a very useful feedback. I've asked you to give
some examples of policies that'd customize some of the NUMA stuff.

There is an EPYC story here also, given the NUMA setup can vary
depending on BIOS setup, associated NUMA policy must probably take that
into account (L3 can be either real cache or 4 extra "local" NUMA nodes
- with highly distinct access cost from a RAM module).
Does that change how PostgreSQL will place memory and process? Is it
important or of interest ?

So how exactly would the policy handle this? Right now we're entirely
oblivious to L3, or on-CPU caches in general. We don't even consider the
size of L3 when sizing hash tables in a hashjoin etc.

regards

--
Tomas Vondra

#37

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#29)

Re: Adding basic NUMA awareness

On 7/9/25 19:23, Andres Freund wrote:

Hi,

On 2025-07-09 12:55:51 -0400, Greg Burd wrote:

On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.

Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...

Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

If you're not already coding this, I'll jump in. :)

My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().

Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.

There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.

Wouldn't it make sense to partition the numBufferAllocs too, though? I
don't remember if my hacky experimental patch NUMA-partitioning did that
or I just thought about doing that, but why wouldn't that be enough?

Places that need the "total" count would have to sum the counters, but
it seemed to me most of the places would be fine with the "local" count
for that partition. If we also make sure to "sync" the clocksweeps so as
to not work on just a single partition, that might be enough ...

regards

--
Tomas Vondra

#38

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#37)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-10 17:31:45 +0200, Tomas Vondra wrote:

On 7/9/25 19:23, Andres Freund wrote:

There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.

Wouldn't it make sense to partition the numBufferAllocs too, though? I
don't remember if my hacky experimental patch NUMA-partitioning did that
or I just thought about doing that, but why wouldn't that be enough?

It could be solved together with partitioning, yes - that's what I was trying
to reference with the emphasized bit in "would *not necessarily* be removed by
a NUMAificiation of the clock sweep".

Greetings,

Andres Freund

#39

[1]: /messages/by-id/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me

andres@anarazel.de

6 months ago

In reply to: Bertrand Drouvot (#35)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-10 14:17:21 +0000, Bertrand Drouvot wrote:

On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:

I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
pages...

I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?

I don't think the increases here are big enough for that to matter, unless
perhaps you're using 1GB huge pages. But if you're concerned about dropping
tables very fast (i.e. you're running schema change heavy regression tests),
you're not going to use 1GB huge pages.

5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
cache line size)

Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.

I don't think the patch itself is the problem - it really is just happenstance
that it's a multiple of the line size right now. And it's not on common Armv8
platforms...

With a bit of reordering:

That could be back to 832 (the order does not make sense logically speaking
though).

I don't think shrinking the size in a one-off way just to keep the
"accidental" size-is-multiple-of-64 property is promising. It'll just get
broken again. I think we should:

a) pad the size of PGPROC to a cache line (or even to a subsequent power of 2,
to make array access cheaper, right now that involves actual
multiplications rather than shifts or indexed `lea` instructions).

That's probably just a pg_attribute_aligned

b) Reorder PGPROC to separate frequently modified from almost-read-only data,
to increase cache hit ratio.

Greetings,

Andres Freund

#40

Burd, Greg

greg@burd.me

6 months ago

In reply to: Burd, Greg (#34)

Re: Adding basic NUMA awareness

On Jul 10, 2025, at 8:13 AM, Burd, Greg <greg@burd.me> wrote:

On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-07-09 12:55:51 -0400, Greg Burd wrote:

On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.

Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...

Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

If you're not already coding this, I'll jump in. :)

My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().

Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.

I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the work and cross-reference when I have something reasonable to show (hopefully later today).

There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that.

Working on it.

For archival sake, and to tie up loose ends I'll link from here to a new thread I just started that proposes the removal of the freelist and the buffer_strategy_lock [1]/messages/by-id/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me.

That patch set doesn't address any NUMA-related tasks directly, but it should remove some pain when working in that direction by removing code that requires partitioning and locking and...

best.

-greg

#41

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#38)

7 attachment(s)

Re: Adding basic NUMA awareness

Hi,

Here's a v2 of the patch series, with a couple changes:

* I simplified the various freelist partitioning by keeping only the
"node" partitioning (so the cpu/pid strategies are gone). Those were
meant for experimenting, but it made the code more complicated so I
ditched it.

* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So
even if your system is not NUMA, you'll have 4 of them. If you have 2
nodes, you'll still have 4, and each node will get 2. With 3 nodes we
get 6 partitions (we need 2 per node, and we want to keep the number
equal to keep things simple). Once the number of nodes exceeds 4, the
heuristics switches to one partition per node.

I'm aware there's a discussion about maybe simply removing freelists
entirely. If that happens, this becomes mostly irrelevant, of course.

The code should also make sure the freelists "agree" with how the
earlier patch mapped the buffers to NUMA nodes, i.e. the freelist should
only contain buffers from the "correct" NUMA node, etc. I haven't paid
much attention to this - I believe it should work for "nice" values of
shared buffers (when it evenly divides between nodes). But I'm sure it's
possible to confuse that (won't cause crashes, but inefficiency).

* There's now a patch partitioning clocksweep, using the same scheme as
the freelists. I came to the conclusion it doesn't make much sense to
partition these things differently - I can't think of a reason why that
would be advantageous, and it makes it easier to reason about.

The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.

It does work, as in "it doesn't crash". But this part definitely needs
review to make sure I got the changes to the "pacing" right.

* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.

I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.

For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.

regards

--
Tomas Vondra

Attachments:

v2-0007-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v2-0007-NUMA-pin-backends-to-NUMA-nodes.patchDownload

From ca651eb85a6656c79fee5aaabc99e4b772b1b8fe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v2 7/7] NUMA: pin backends to NUMA nodes

When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
 src/backend/storage/lmgr/proc.c     | 21 +++++++++++++++++++++
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 33 insertions(+)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9d3e94a7b3a..4c9e55608b2 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -729,6 +729,27 @@ InitProcess(void)
 	}
 	MyProcNumber = GetNumberFromPGProc(MyProc);
 
+	/*
+	 * Optionally, restrict the process to only run on CPUs from the same NUMA
+	 * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+	 * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+	 *
+	 * This also means we only do this with numa_procs_interleave, because
+	 * without that we'll have numa_node=-1 for all PGPROC entries.
+	 *
+	 * FIXME add proper error-checking for libnuma functions
+	 */
+	if (numa_procs_pin && MyProc->numa_node != -1)
+	{
+		struct bitmask *cpumask = numa_allocate_cpumask();
+
+		numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+		numa_sched_setaffinity(MyProcPid, cpumask);
+
+		numa_free_cpumask(cpumask);
+	}
+
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
 	 * the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ee4684d1b8..3f88659b49f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 bool		numa_partition_freelist = false;
 bool		numa_procs_interleave = false;
+bool		numa_procs_pin = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7b718760248..862341e137e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2156,6 +2156,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+			gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+		},
+		&numa_procs_pin,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cdeee8dccba..a97741c6707 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT bool numa_partition_freelist;
 extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.49.0

v2-0006-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v2-0006-NUMA-interleave-PGPROC-entries.patchDownload

From 0d79d2fb6ab9f1d5b0b3f03e500315135329b09e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v2 6/7] NUMA: interleave PGPROC entries

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
 src/backend/access/transam/clog.c      |   4 +-
 src/backend/postmaster/pgarch.c        |   2 +-
 src/backend/postmaster/walsummarizer.c |   2 +-
 src/backend/storage/buffer/freelist.c  |   2 +-
 src/backend/storage/ipc/procarray.c    |  61 ++--
 src/backend/storage/lmgr/lock.c        |   6 +-
 src/backend/storage/lmgr/proc.c        | 368 +++++++++++++++++++++++--
 src/backend/utils/init/globals.c       |   1 +
 src/backend/utils/misc/guc_tables.c    |  10 +
 src/include/miscadmin.h                |   1 +
 src/include/storage/proc.h             |  11 +-
 11 files changed, 406 insertions(+), 62 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 777c9a8d555..087279a6a8e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 1827e052da7..2ce158ca9bd 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -446,7 +446,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2418967def6..82158eeb5d6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1622,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1739,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2223,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2297,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2498,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2724,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2755,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -3006,7 +3005,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3047,7 +3046,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3175,7 +3174,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3218,7 +3217,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3287,7 +3286,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3389,7 +3388,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3454,7 +3453,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3509,7 +3508,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3555,7 +3554,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3584,7 +3583,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3615,7 +3614,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3656,7 +3655,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3719,7 +3718,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3785,7 +3784,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..c84a2a5f1bc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..9d3e94a7b3a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -89,6 +97,12 @@ static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
+/* NUMA */
+static Size get_memory_page_size(void); /* XXX duplicate */
+static void move_to_node(char *startptr, char *endptr,
+						 Size mem_page_size, int node);
+static int	numa_nodes = -1;
+
 
 /*
  * Report shared-memory space needed by PGPROC.
@@ -100,11 +114,40 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * With NUMA, we allocate the PGPROC array in several chunks. With shared
+	 * buffers we simply manually assign parts of the buffer array to
+	 * different NUMA nodes, and that does the trick. But we can't do that for
+	 * PGPROC, as the number of PGPROC entries is much lower, especially with
+	 * huge pages. We can fit ~2k entries on a 2MB page, and NUMA does stuff
+	 * with page granularity, and the large NUMA systems are likely to use
+	 * huge pages. So with sensible max_connections we would not use more than
+	 * a single page, which means it gets to a single NUMA node.
+	 *
+	 * So we allocate PGPROC not as a single array, but one array per NUMA
+	 * node, and then one array for aux processes (without NUMA node
+	 * assigned). Each array may need up to memory-page-worth of padding,
+	 * worst case. So we just add that - it's a bit wasteful, but good enough
+	 * for PoC.
+	 *
+	 * FIXME Should be conditional, but that was causing problems in bootstrap
+	 * mode. Or maybe it was because the code that allocates stuff later does
+	 * not do that conditionally. Anyway, needs to be fixed.
+	 */
+	/* if (numa_procs_interleave) */
+	{
+		int			num_nodes = numa_num_configured_nodes();
+		Size		mem_page_size = get_memory_page_size();
+
+		size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+	}
+
 	return size;
 }
 
@@ -129,6 +172,26 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * Same NUMA-padding logic as in PGProcShmemSize, adding a memory page per
+	 * NUMA node - but this way we add two pages per node - one for PGPROC,
+	 * one for fast-path arrays. In theory we could make this work just one
+	 * page per node, by adding fast-path arrays right after PGPROC entries on
+	 * each node. But now we allocate fast-path locks separately - good enough
+	 * for PoC.
+	 *
+	 * FIXME Should be conditional, but that was causing problems in bootstrap
+	 * mode. Or maybe it was because the code that allocates stuff later does
+	 * not do that conditionally. Anyway, needs to be fixed.
+	 */
+	/* if (numa_procs_interleave) */
+	{
+		int			num_nodes = numa_num_configured_nodes();
+		Size		mem_page_size = get_memory_page_size();
+
+		size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+	}
+
 	return size;
 }
 
@@ -191,11 +254,13 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
 	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+	int			procs_total;
+	int			procs_per_node;
 
 	/* Used for setup of per-backend fast-path slots. */
 	char	   *fpPtr,
@@ -205,6 +270,8 @@ InitProcGlobal(void)
 	Size		requestSize;
 	char	   *ptr;
 
+	Size		mem_page_size = get_memory_page_size();
+
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -224,6 +291,9 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* one chunk per NUMA node (without NUMA assume 1 node) */
+	numa_nodes = numa_num_configured_nodes();
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +311,108 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * NUMA partitioning
+	 *
+	 * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+	 * extra for auxiliary processes and 2PC transactions, not associated with
+	 * any particular node).
+	 *
+	 * First determine how many "backend" procs to allocate per NUMA node. The
+	 * count may not be exactly divisible, but we mostly ignore that. The last
+	 * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+	 * be pretty small (if MaxBackends >> numa_nodes).
+	 *
+	 * XXX A fairer distribution is possible, but not worth it now.
+	 */
+	procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+	procs_total = 0;
+
+	/* build PGPROC entries for NUMA nodes */
+	for (i = 0; i < numa_nodes; i++)
+	{
+		PGPROC	   *procs_node;
+
+		/* the last NUMA node may get fewer PGPROC entries, but meh */
+		int			count_node = Min(procs_per_node, MaxBackends - procs_total);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+		/* allocate the PGPROC chunk for this node */
+		procs_node = (PGPROC *) ptr;
+		ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		/* add pointers to the PGPROC entries to allProcs */
+		for (j = 0; j < count_node; j++)
+		{
+			procs_node[j].numa_node = i;
+			procs_node[j].procnumber = procs_total;
+
+			ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+		}
+
+		move_to_node((char *) procs_node, ptr, mem_page_size, i);
+	}
+
+	/*
+	 * also build PGPROC entries for auxiliary procs / prepared xacts (we
+	 * don't assign those to any NUMA node)
+	 *
+	 * XXX Mostly duplicate of preceding block, could be reused.
+	 */
+	{
+		PGPROC	   *procs_node;
+		int			count_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/*
+		 * Make sure to align PGPROC array to memory page (it may not be
+		 * aligned). We won't assign this to any NUMA node, but we still don't
+		 * want it to interfere with the preceding chunk (for the last NUMA
+		 * node).
+		 */
+		ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+		procs_node = (PGPROC *) ptr;
+		ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		/* now add the PGPROC pointers to allProcs */
+		for (j = 0; j < count_node; j++)
+		{
+			procs_node[j].numa_node = -1;
+			procs_node[j].procnumber = procs_total;
+
+			ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+		}
+	}
+
+	/* we should have allocated the expected number of PGPROC entries */
+	Assert(procs_total == TotalProcs);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,23 +445,100 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/* reset the count */
+	procs_total = 0;
+
+	/*
+	 * Mimic the same logic as above, but for fast-path locking.
+	 */
+	for (i = 0; i < numa_nodes; i++)
 	{
-		PGPROC	   *proc = &procs[i];
+		char	   *startptr;
+		char	   *endptr;
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+		/* the last NUMA node may get fewer PGPROC entries, but meh */
+		int			procs_node = Min(procs_per_node, MaxBackends - procs_total);
+
+		/* align to memory page, to make move_pages possible */
+		fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+		startptr = fpPtr;
+		endptr = fpPtr + procs_node * (fpLockBitsSize + fpRelIdSize);
+
+		move_to_node(startptr, endptr, mem_page_size, i);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Now point the PGPROC entries to the fast-path arrays, and also
+		 * advance the fpPtr.
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		for (j = 0; j < procs_node; j++)
+		{
+			PGPROC	   *proc = ProcGlobal->allProcs[procs_total++];
+
+			/* cross-check we got the expected NUMA node */
+			Assert(proc->numa_node == i);
+			Assert(proc->procnumber == (procs_total - 1));
+
+			/*
+			 * Set the fast-path lock arrays, and move the pointer. We
+			 * interleave the two arrays, to (hopefully) get some locality for
+			 * each backend.
+			 */
+			proc->fpLockBits = (uint64 *) fpPtr;
+			fpPtr += fpLockBitsSize;
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+			proc->fpRelId = (Oid *) fpPtr;
+			fpPtr += fpRelIdSize;
 
-		Assert(fpPtr <= fpEndPtr);
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(fpPtr == endptr);
+	}
+
+	/* auxiliary processes / prepared xacts */
+	{
+		/* the last NUMA node may get fewer PGPROC entries, but meh */
+		int			procs_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* align to memory page, to make move_pages possible */
+		fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+		/* now point the PGPROC entries to the fast-path arrays */
+		for (j = 0; j < procs_node; j++)
+		{
+			PGPROC	   *proc = ProcGlobal->allProcs[procs_total++];
+
+			/* cross-check we got PGPROC with no NUMA node assigned */
+			Assert(proc->numa_node == -1);
+			Assert(proc->procnumber == (procs_total - 1));
+
+			/*
+			 * Set the fast-path lock arrays, and move the pointer. We
+			 * interleave the two arrays, to (hopefully) get some locality for
+			 * each backend.
+			 */
+			proc->fpLockBits = (uint64 *) fpPtr;
+			fpPtr += fpLockBitsSize;
+
+			proc->fpRelId = (Oid *) fpPtr;
+			fpPtr += fpRelIdSize;
+
+			Assert(fpPtr <= fpEndPtr);
+		}
+	}
+
+	/* Should have consumed exactly the expected amount of fast-path memory. */
+	Assert(fpPtr <= fpEndPtr);
+
+	/* make sure we allocated the expected number of PGPROC entries */
+	Assert(procs_total == TotalProcs);
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
 
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
@@ -366,15 +602,12 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
 	 */
-	AuxiliaryProcs = &procs[MaxBackends];
-	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+	AuxiliaryProcs = procs[MaxBackends];
+	PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
 	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +668,45 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if (numa_procs_interleave)
+		{
+			dlist_mutable_iter iter;
+			unsigned	cpu;
+			unsigned	node;
+			int			rc;
+
+			rc = getcpu(&cpu, &node);
+			if (rc != 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -1988,7 +2259,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2334,60 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/*
+	 * XXX This is a bit annoying/confusing, because we may get a different
+	 * result depending on when we call it. Before mmap() we don't know if the
+	 * huge pages get used, so we assume they will. And then if we don't get
+	 * huge pages, we'll waste memory etc.
+	 */
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status == HUGE_PAGES_OFF)
+		huge_page_size = 0;
+	else
+		GetHugePageSize(&huge_page_size, NULL);
+
+	return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * move_to_node
+ *		move all pages in the given range to the requested NUMA node
+ *
+ * XXX This is expected to only process fairly small number of pages, so no
+ * need to do batching etc. Just move pages one by one.
+ */
+static void
+move_to_node(char *startptr, char *endptr, Size mem_page_size, int node)
+{
+	while (startptr < endptr)
+	{
+		int			r,
+					status;
+
+		r = numa_move_pages(0, 1, (void **) &startptr, &node, &status, 0);
+
+		if (r != 0)
+			elog(WARNING, "failed to move page to NUMA node %d (r = %d, status = %d)",
+				 node, r, status);
+
+		startptr += mem_page_size;
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a11bc71a386..6ee4684d1b8 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int			MaxBackends = 0;
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 bool		numa_partition_freelist = false;
+bool		numa_procs_interleave = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0552ed62cc7..7b718760248 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2146,6 +2146,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+			gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+		},
+		&numa_procs_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 66baf2bf33e..cdeee8dccba 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT bool numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5cb1632718e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,6 +194,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -319,6 +321,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -383,7 +388,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -435,8 +440,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
-- 
2.49.0

v2-0005-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v2-0005-NUMA-clockweep-partitioning.patchDownload

From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v2 5/7] NUMA: clockweep partitioning

---
 src/backend/storage/buffer/bufmgr.c   | 473 ++++++++++++++------------
 src/backend/storage/buffer/freelist.c | 202 ++++++++---
 src/include/storage/buf_internals.h   |   4 +-
 3 files changed, 424 insertions(+), 255 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5922689fe5d..3d6c834d77c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3587,6 +3587,23 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
+ *
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
+ *
+ * XXX might be better to have a per-partition struct with all the info
+ */
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3602,55 +3619,24 @@ bool
 BgBufferSync(WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
-	int			strategy_buf_id;
-	uint32		strategy_passes;
 	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
-	/* Potentially these could be tunables, but for now, not */
-	float		smoothing_samples = 16;
-	float		scan_whole_pool_milliseconds = 120000.0;
-
-	/* Used to compute how far we scan ahead */
-	long		strategy_delta;
-	int			bufs_to_lap;
-	int			bufs_ahead;
-	float		scans_per_alloc;
-	int			reusable_buffers_est;
-	int			upcoming_alloc_est;
-	int			min_scan_buffers;
-
-	/* Variables for the scanning loop proper */
-	int			num_to_scan;
-	int			num_written;
-	int			reusable_buffers;
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
 
-	/* Variables for final smoothed_density update */
-	long		new_strategy_delta;
-	uint32		new_recent_alloc;
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
-	/*
-	 * Find out where the freelist clock sweep currently is, and how many
-	 * buffer allocations have happened since our last call.
-	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
 
 	/* Report buffer alloc counts to pgstat */
 	PendingBgWriterStats.buf_alloc += recent_alloc;
 
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -3663,223 +3649,282 @@ BgBufferSync(WritebackContext *wb_context)
 	}
 
 	/*
-	 * Compute strategy_delta = how many buffers have been scanned by the
-	 * clock sweep since last time.  If first time through, assume none. Then
-	 * see if we are still ahead of the clock sweep, and if so, how many
-	 * buffers we could scan before we'd catch up with it and "lap" it. Note:
-	 * weird-looking coding of xxx_passes comparisons are to avoid bogus
-	 * behavior when the passes counts wrap around.
-	 */
-	if (saved_info_valid)
-	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
-
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that
+	 * we don't start from partition 0 all the time? Perhaps not entirely,
+	 * but at least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* info obtained from freelist.c */
+		int			strategy_buf_id;
+		uint32		strategy_passes;
+
+		/* Moving averages of allocation rate and clean-buffer density */
+		static float smoothed_alloc = 0;
+		static float smoothed_density = 10.0;
+
+		/* Potentially these could be tunables, but for now, not */
+		float		smoothing_samples = 16;
+		float		scan_whole_pool_milliseconds = 120000.0;
+
+		/* Used to compute how far we scan ahead */
+		long		strategy_delta;
+		int			bufs_to_lap;
+		int			bufs_ahead;
+		float		scans_per_alloc;
+		int			reusable_buffers_est;
+		int			upcoming_alloc_est;
+		int			min_scan_buffers;
+
+		/* Variables for the scanning loop proper */
+		int			num_to_scan;
+		int			num_written;
+		int			reusable_buffers;
+
+		/* Variables for final smoothed_density update */
+		long		new_strategy_delta;
+		uint32		new_recent_alloc;
+
+		/* buffer range for the clocksweep partition */
+		int			first_buffer;
+		int			num_buffers;
 
-		Assert(strategy_delta >= 0);
+		/*
+		 * Find out where the freelist clock sweep currently is, and how many
+		 * buffer allocations have happened since our last call.
+		 */
+		strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+											&first_buffer, &num_buffers);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		/*
+		 * Compute strategy_delta = how many buffers have been scanned by the
+		 * clock sweep since last time.  If first time through, assume none. Then
+		 * see if we are still ahead of the clock sweep, and if so, how many
+		 * buffers we could scan before we'd catch up with it and "lap" it. Note:
+		 * weird-looking coding of xxx_passes comparisons are to avoid bogus
+		 * behavior when the passes counts wrap around.
+		 */
+		if (saved_info_valid)
 		{
-			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
+
+			strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+			strategy_delta += (long) passes_delta * num_buffers;
+
+			Assert(strategy_delta >= 0);
+
+			if ((int32) (next_passes[partition] - strategy_passes) > 0)
+			{
+				/* we're one pass ahead of the strategy point */
+				bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta, bufs_to_lap);
+				elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta, bufs_to_lap);
 #endif
-		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
-		{
-			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			}
+			else if (next_passes[partition] == strategy_passes &&
+					 next_to_clean[partition] >= strategy_buf_id)
+			{
+				/* on same pass, but ahead or at least not behind */
+				bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
+#ifdef BGW_DEBUG
+				elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta, bufs_to_lap);
+#endif
+			}
+			else
+			{
+				/*
+				 * We're behind, so skip forward to the strategy point and start
+				 * cleaning from there.
+				 */
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta, bufs_to_lap);
+				elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta);
 #endif
+				next_to_clean[partition] = strategy_buf_id;
+				next_passes[partition] = strategy_passes;
+				bufs_to_lap = num_buffers;
+			}
 		}
 		else
 		{
 			/*
-			 * We're behind, so skip forward to the strategy point and start
-			 * cleaning from there.
+			 * Initializing at startup or after LRU scanning had been off. Always
+			 * start at the strategy point.
 			 */
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta);
+			elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
+				 strategy_passes, strategy_buf_id);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			strategy_delta = 0;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
-	}
-	else
-	{
-		/*
-		 * Initializing at startup or after LRU scanning had been off. Always
-		 * start at the strategy point.
-		 */
-#ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
-			 strategy_passes, strategy_buf_id);
-#endif
-		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
-	}
 
-	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+		/* Update saved info for next time */
+		prev_strategy_buf_id[partition] = strategy_buf_id;
+		prev_strategy_passes[partition] = strategy_passes;
+		// FIXME has to happen after all partitions
+		// saved_info_valid = true;
 
-	/*
-	 * Compute how many buffers had to be scanned for each new allocation, ie,
-	 * 1/density of reusable buffers, and track a moving average of that.
-	 *
-	 * If the strategy point didn't move, we don't update the density estimate
-	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
-	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
-		smoothed_density += (scans_per_alloc - smoothed_density) /
-			smoothing_samples;
-	}
+		/*
+		 * Compute how many buffers had to be scanned for each new allocation, ie,
+		 * 1/density of reusable buffers, and track a moving average of that.
+		 *
+		 * If the strategy point didn't move, we don't update the density estimate
+		 */
+		if (strategy_delta > 0 && recent_alloc_partition > 0)
+		{
+			scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
+			smoothed_density += (scans_per_alloc - smoothed_density) /
+				smoothing_samples;
+		}
 
-	/*
-	 * Estimate how many reusable buffers there are between the current
-	 * strategy point and where we've scanned ahead to, based on the smoothed
-	 * density estimate.
-	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
-	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
+		/*
+		 * Estimate how many reusable buffers there are between the current
+		 * strategy point and where we've scanned ahead to, based on the smoothed
+		 * density estimate.
+		 */
+		bufs_ahead = num_buffers - bufs_to_lap;
+		reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
-	/*
-	 * Track a moving average of recent buffer allocations.  Here, rather than
-	 * a true average we want a fast-attack, slow-decline behavior: we
-	 * immediately follow any increase.
-	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
-	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
-			smoothing_samples;
+		/*
+		 * Track a moving average of recent buffer allocations.  Here, rather than
+		 * a true average we want a fast-attack, slow-decline behavior: we
+		 * immediately follow any increase.
+		 */
+		if (smoothed_alloc <= (float) recent_alloc_partition)
+			smoothed_alloc = recent_alloc_partition;
+		else
+			smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
+				smoothing_samples;
 
-	/* Scale the estimate by a GUC to allow more aggressive tuning. */
-	upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
+		/* Scale the estimate by a GUC to allow more aggressive tuning. */
+		upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
 
-	/*
-	 * If recent_alloc remains at zero for many cycles, smoothed_alloc will
-	 * eventually underflow to zero, and the underflows produce annoying
-	 * kernel warnings on some platforms.  Once upcoming_alloc_est has gone to
-	 * zero, there's no point in tracking smaller and smaller values of
-	 * smoothed_alloc, so just reset it to exactly zero to avoid this
-	 * syndrome.  It will pop back up as soon as recent_alloc increases.
-	 */
-	if (upcoming_alloc_est == 0)
-		smoothed_alloc = 0;
+		/*
+		 * If recent_alloc remains at zero for many cycles, smoothed_alloc will
+		 * eventually underflow to zero, and the underflows produce annoying
+		 * kernel warnings on some platforms.  Once upcoming_alloc_est has gone to
+		 * zero, there's no point in tracking smaller and smaller values of
+		 * smoothed_alloc, so just reset it to exactly zero to avoid this
+		 * syndrome.  It will pop back up as soon as recent_alloc increases.
+		 */
+		if (upcoming_alloc_est == 0)
+			smoothed_alloc = 0;
 
-	/*
-	 * Even in cases where there's been little or no buffer allocation
-	 * activity, we want to make a small amount of progress through the buffer
-	 * cache so that as many reusable buffers as possible are clean after an
-	 * idle period.
-	 *
-	 * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
-	 * the BGW will be called during the scan_whole_pool time; slice the
-	 * buffer pool into that many sections.
-	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+		/*
+		 * Even in cases where there's been little or no buffer allocation
+		 * activity, we want to make a small amount of progress through the buffer
+		 * cache so that as many reusable buffers as possible are clean after an
+		 * idle period.
+		 *
+		 * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
+		 * the BGW will be called during the scan_whole_pool time; slice the
+		 * buffer pool into that many sections.
+		 */
+		min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
-	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
-	{
+		if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
+		{
 #ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
-			 upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
+			elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
+				 upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
 #endif
-		upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
-	}
-
-	/*
-	 * Now write out dirty reusable buffers, working forward from the
-	 * next_to_clean point, until we have lapped the strategy scan, or cleaned
-	 * enough buffers to match our estimate of the next cycle's allocation
-	 * requirements, or hit the bgwriter_lru_maxpages limit.
-	 */
+			upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
+		}
 
-	num_to_scan = bufs_to_lap;
-	num_written = 0;
-	reusable_buffers = reusable_buffers_est;
+		/*
+		 * Now write out dirty reusable buffers, working forward from the
+		 * next_to_clean point, until we have lapped the strategy scan, or cleaned
+		 * enough buffers to match our estimate of the next cycle's allocation
+		 * requirements, or hit the bgwriter_lru_maxpages limit.
+		 */
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
-	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		num_to_scan = bufs_to_lap;
+		num_written = 0;
+		reusable_buffers = reusable_buffers_est;
 
-		if (++next_to_clean >= NBuffers)
+		/* Execute the LRU scan */
+		while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 		{
-			next_to_clean = 0;
-			next_passes++;
-		}
-		num_to_scan--;
+			int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
+												   wb_context);
 
-		if (sync_state & BUF_WRITTEN)
-		{
-			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 			{
-				PendingBgWriterStats.maxwritten_clean++;
-				break;
+				next_to_clean[partition] = first_buffer;
+				next_passes[partition]++;
+			}
+			num_to_scan--;
+
+			if (sync_state & BUF_WRITTEN)
+			{
+				reusable_buffers++;
+				if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
+				{
+					PendingBgWriterStats.maxwritten_clean++;
+					break;
+				}
 			}
+			else if (sync_state & BUF_REUSABLE)
+				reusable_buffers++;
 		}
-		else if (sync_state & BUF_REUSABLE)
-			reusable_buffers++;
-	}
 
-	PendingBgWriterStats.buf_written_clean += num_written;
+		PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
-	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
-		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
-		 bufs_to_lap - num_to_scan,
-		 num_written,
-		 reusable_buffers - reusable_buffers_est);
+		elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
+			 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
+			 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
+			 bufs_to_lap - num_to_scan,
+			 num_written,
+			 reusable_buffers - reusable_buffers_est);
 #endif
 
-	/*
-	 * Consider the above scan as being like a new allocation scan.
-	 * Characterize its density and update the smoothed one based on it. This
-	 * effectively halves the moving average period in cases where both the
-	 * strategy and the background writer are doing some useful scanning,
-	 * which is helpful because a long memory isn't as desirable on the
-	 * density estimates.
-	 */
-	new_strategy_delta = bufs_to_lap - num_to_scan;
-	new_recent_alloc = reusable_buffers - reusable_buffers_est;
-	if (new_strategy_delta > 0 && new_recent_alloc > 0)
-	{
-		scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
-		smoothed_density += (scans_per_alloc - smoothed_density) /
-			smoothing_samples;
+		/*
+		 * Consider the above scan as being like a new allocation scan.
+		 * Characterize its density and update the smoothed one based on it. This
+		 * effectively halves the moving average period in cases where both the
+		 * strategy and the background writer are doing some useful scanning,
+		 * which is helpful because a long memory isn't as desirable on the
+		 * density estimates.
+		 */
+		new_strategy_delta = bufs_to_lap - num_to_scan;
+		new_recent_alloc = reusable_buffers - reusable_buffers_est;
+		if (new_strategy_delta > 0 && new_recent_alloc > 0)
+		{
+			scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
+			smoothed_density += (scans_per_alloc - smoothed_density) /
+				smoothing_samples;
 
 #ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
-			 new_recent_alloc, new_strategy_delta,
-			 scans_per_alloc, smoothed_density);
+			elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
+				 new_recent_alloc, new_strategy_delta,
+				 scans_per_alloc, smoothed_density);
 #endif
+		}
+
+		/* hibernate if all partitions can hibernate */
+		hibernate &= (bufs_to_lap == 0 && recent_alloc_partition == 0);
 	}
 
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e38e5c7ec3d..1827e052da7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -63,17 +63,27 @@ typedef struct BufferStrategyFreelist
 #define MIN_FREELIST_PARTITIONS		4
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -83,6 +93,15 @@ typedef struct
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
 
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
@@ -99,6 +118,9 @@ typedef struct
 	int			num_partitions_groups;	/* effectively num of NUMA nodes */
 	int			num_partitions_per_group;
 
+	/* clocksweep partitions */
+	ClockSweep *sweeps;
+
 	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
@@ -152,6 +174,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -163,6 +186,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -170,14 +194,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -203,19 +227,23 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
 }
 
 /*
@@ -289,6 +317,28 @@ calculate_partition_index()
 	return index;
 }
 
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
+}
+
 /*
  * ChooseFreeList
  *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
@@ -404,7 +454,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
@@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Nothing on the freelist, so run the "clock sweep" algorithm
 	 *
-	 * XXX Should we also make this NUMA-aware, to only access buffers from
-	 * the same NUMA node? That'd probably mean we need to make the clock
-	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
-	 * subset of buffers. But that also means each process could "sweep" only
-	 * a fraction of buffers, even if the other buffers are better candidates
-	 * for eviction. Would that also mean we'd have multiple bgwriters, one
-	 * for each node, or would one bgwriter handle all of that?
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That
+	 * means it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists
+	 * or other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
 	 */
 	trycounter = NBuffers;
 	for (;;)
@@ -563,6 +617,41 @@ StrategyFreeBuffer(BufferDesc *buf)
 	SpinLockRelease(&freelist->freelist_lock);
 }
 
+/*
+ * StrategySyncStart -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			*num_buf_alloc += pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -570,37 +659,44 @@ StrategyFreeBuffer(BufferDesc *buf)
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -696,6 +792,10 @@ StrategyShmemSize(void)
 	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
 											num_partitions)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -714,6 +814,7 @@ StrategyInitialize(bool init)
 
 	int			num_partitions;
 	int			num_partitions_per_group;
+	char	   *ptr;
 
 	/* */
 	num_partitions = calculate_partition_count(strategy_nnodes);
@@ -736,7 +837,8 @@ StrategyInitialize(bool init)
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
 						MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
-						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
+						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -758,12 +860,32 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* have to point the sweeps array to right after the freelists */
+		ptr = (char *) StrategyControl +
+				MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+				MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions);
+		StrategyControl->sweeps = (ClockSweep *) ptr;
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not
+			 * a perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = numBuffers;
+			StrategyControl->sweeps[i].firstBuffer = (numBuffers * i);
+
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..b50f9458156 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -448,7 +448,9 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
-- 
2.49.0

v2-0004-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v2-0004-NUMA-partition-buffer-freelist.patchDownload

From d67278a64983b5f2eb5e408a51e9516aa8fd2264 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v2 4/7] NUMA: partition buffer freelist

Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.

There are four strategies, specified by GUC numa_partition_freelist

* none - single long freelist, should work just like now

* node - one freelist per NUMA node, with only buffers from that node

* cpu - one freelist per CPU

* pid - freelist determined by PID (same number of freelists as 'cpu')

When allocating a buffer, it's taken from the correct freelist (e.g.
same NUMA node).

Note: This is (probably) more important than partitioning ProcArray.
---
 src/backend/storage/buffer/buf_init.c |   4 +-
 src/backend/storage/buffer/freelist.c | 372 ++++++++++++++++++++++++--
 src/backend/utils/init/globals.c      |   1 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/include/miscadmin.h               |   1 +
 src/include/storage/bufmgr.h          |   8 +
 6 files changed, 367 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 2ad34624c49..920f1a32a8f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -543,8 +543,8 @@ pg_numa_interleave_memory(char *startptr, char *endptr,
 		 * XXX no return value, to make this fail on error, has to use
 		 * numa_set_strict
 		 *
-		 * XXX Should we still touch the memory first, like with numa_move_pages,
-		 * or is that not necessary?
+		 * XXX Should we still touch the memory first, like with
+		 * numa_move_pages, or is that not necessary?
 		 */
 		numa_tonode_memory(ptr, sz, node);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..e38e5c7ec3d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,52 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+	/* Spinlock: protects the values below */
+	slock_t		freelist_lock;
+
+	/*
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres.
+	 */
+	int			firstFreeBuffer __attribute__((aligned(64)));	/* Head of list of
+																 * unused buffers */
+
+	/* Number of buffers consumed from this list. */
+	uint64		consumed;
+}			BufferStrategyFreelist;
+
+/*
+ * The minimum number of partitions we want to have. We want at least this
+ * number of partitions, even on non-NUMA system, as it helps with contention
+ * for buffers. But with multiple NUMA nodes, we want a separate partition per
+ * node. But we may get multiple partitions per node, for low node count.
+ *
+ * With multiple partitions per NUMA node, we pick the partition based on CPU
+ * (or some other parameter).
+ */
+#define MIN_FREELIST_PARTITIONS		4
 
 /*
  * The shared freelist control information.
@@ -39,8 +77,6 @@ typedef struct
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
-	int			firstFreeBuffer;	/* Head of list of unused buffers */
-
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
@@ -51,13 +87,38 @@ typedef struct
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
+	 *
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres. Also, shouldn't the alignment be specified after, like for
+	 * "consumed"?
 	 */
-	int			bgwprocno;
+	int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_partitions;
+	int			num_partitions_groups;	/* effectively num of NUMA nodes */
+	int			num_partitions_per_group;
+
+	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
 static BufferStrategyControl *StrategyControl = NULL;
 
+/*
+ * XXX shouldn't this be in BufferStrategyControl? Probably not, we need to
+ * calculate it during sizing, and perhaps it could change before the memory
+ * gets allocated (so we need to remember the values).
+ *
+ * XXX We should probably have a fixed number of partitions, and map the
+ * NUMA nodes to them, somehow (i.e. each node would get some subset of
+ * partitions). Similar to NUM_LOCK_PARTITIONS.
+ *
+ * XXX We don't use the ncpus, really.
+ */
+static int	strategy_ncpus;
+static int	strategy_nnodes;
+
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
  * This is currently the only kind of BufferAccessStrategy object, but someday
@@ -157,6 +218,104 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Size the clocksweep partitions. At least one partition per NUMA node,
+ * but at least MIN_FREELIST_PARTITIONS partitions in total.
+*/
+static int
+calculate_partition_count(int num_nodes)
+{
+	int		num_per_node = 1;
+
+	while (num_per_node * num_nodes < MIN_FREELIST_PARTITIONS)
+		num_per_node++;
+
+	return (num_nodes * num_per_node);
+}
+
+static int
+calculate_partition_index()
+{
+	int			rc;
+	unsigned	cpu;
+	unsigned	node;
+	int			index;
+
+	Assert(StrategyControl->num_partitions_groups == strategy_nnodes);
+
+	Assert(StrategyControl->num_partitions ==
+		   (strategy_nnodes * StrategyControl->num_partitions_per_group));
+
+	/*
+	 * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+	 * list based on that.
+	 */
+	rc = getcpu(&cpu, &node);
+	if (rc != 0)
+		elog(ERROR, "getcpu failed: %m");
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building
+	 * the partitions. Maybe if we allow this (e.g. due to support adjusting
+	 * the NUMA stuff at runtime), we should just do our best to minimize
+	 * the conflicts somehow. But it'll make the mapping harder, so for now
+	 * we ignore it.
+	 */
+	if (node > strategy_nnodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, strategy_nnodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_group == 1)
+	{
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int		index_group,
+				index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_partitions_groups);
+		index_part = (cpu % StrategyControl->num_partitions_per_group);
+
+		index = (index_group * StrategyControl->num_partitions_per_group)
+				+ index_part;
+	}
+
+	return index;
+}
+
+/*
+ * ChooseFreeList
+ *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+	int index = calculate_partition_index();
+	return &StrategyControl->freelists[index];
+}
+
 /*
  * have_free_buffer -- a lockless check to see if there is a free buffer in
  *					   buffer pool.
@@ -168,10 +327,13 @@ ClockSweepTick(void)
 bool
 have_free_buffer(void)
 {
-	if (StrategyControl->firstFreeBuffer >= 0)
-		return true;
-	else
-		return false;
+	for (int i = 0; i < strategy_nnodes; i++)
+	{
+		if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+			return true;
+	}
+
+	return false;
 }
 
 /*
@@ -193,6 +355,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	BufferStrategyFreelist *freelist;
 
 	*from_ring = false;
 
@@ -259,31 +422,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
 	 * manipulate them without holding the spinlock.
 	 */
-	if (StrategyControl->firstFreeBuffer >= 0)
+	freelist = ChooseFreeList();
+	if (freelist->firstFreeBuffer >= 0)
 	{
 		while (true)
 		{
 			/* Acquire the spinlock to remove element from the freelist */
-			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			SpinLockAcquire(&freelist->freelist_lock);
 
-			if (StrategyControl->firstFreeBuffer < 0)
+			if (freelist->firstFreeBuffer < 0)
 			{
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				SpinLockRelease(&freelist->freelist_lock);
 				break;
 			}
 
-			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+			buf = GetBufferDescriptor(freelist->firstFreeBuffer);
 			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
 			/* Unconditionally remove buffer from freelist */
-			StrategyControl->firstFreeBuffer = buf->freeNext;
+			freelist->firstFreeBuffer = buf->freeNext;
 			buf->freeNext = FREENEXT_NOT_IN_LIST;
 
+			/* increment number of buffers we consumed from this list */
+			freelist->consumed++;
+
 			/*
 			 * Release the lock so someone else can access the freelist while
 			 * we check out this buffer.
 			 */
-			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			SpinLockRelease(&freelist->freelist_lock);
 
 			/*
 			 * If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +472,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		}
 	}
 
-	/* Nothing on the freelist, so run the "clock sweep" algorithm */
+	/*
+	 * Nothing on the freelist, so run the "clock sweep" algorithm
+	 *
+	 * XXX Should we also make this NUMA-aware, to only access buffers from
+	 * the same NUMA node? That'd probably mean we need to make the clock
+	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+	 * subset of buffers. But that also means each process could "sweep" only
+	 * a fraction of buffers, even if the other buffers are better candidates
+	 * for eviction. Would that also mean we'd have multiple bgwriters, one
+	 * for each node, or would one bgwriter handle all of that?
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -356,7 +533,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 void
 StrategyFreeBuffer(BufferDesc *buf)
 {
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	BufferStrategyFreelist *freelist;
+
+	/*
+	 * We don't want to call ChooseFreeList() again, because we might get a
+	 * completely different freelist - either a different partition in the
+	 * same group, or even a different group if the NUMA node changed. But
+	 * we can calculate the proper freelist from the buffer id.
+	 */
+	int index = (BufferGetNode(buf->buf_id) * StrategyControl->num_partitions_per_group)
+				 + (buf->buf_id % StrategyControl->num_partitions_per_group);
+
+	Assert((index >= 0) && (index < StrategyControl->num_partitions));
+
+	freelist = &StrategyControl->freelists[index];
+
+	SpinLockAcquire(&freelist->freelist_lock);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -364,11 +556,11 @@ StrategyFreeBuffer(BufferDesc *buf)
 	 */
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
-		buf->freeNext = StrategyControl->firstFreeBuffer;
-		StrategyControl->firstFreeBuffer = buf->buf_id;
+		buf->freeNext = freelist->firstFreeBuffer;
+		freelist->firstFreeBuffer = buf->buf_id;
 	}
 
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+	SpinLockRelease(&freelist->freelist_lock);
 }
 
 /*
@@ -432,6 +624,42 @@ StrategyNotifyBgWriter(int bgwprocno)
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+	for (int node = 0; node < strategy_nnodes; node++)
+	{
+		BufferStrategyFreelist *freelist = &StrategyControl->freelists[node];
+		uint64		remain = 0;
+		uint64		actually_free = 0;
+		int			cur = freelist->firstFreeBuffer;
+
+		while (cur >= 0)
+		{
+			uint32		local_buf_state;
+			BufferDesc *buf;
+
+			buf = GetBufferDescriptor(cur);
+
+			remain++;
+
+			local_buf_state = LockBufHdr(buf);
+
+			if (!(local_buf_state & BM_TAG_VALID))
+				actually_free++;
+
+			UnlockBufHdr(buf, local_buf_state);
+
+			cur = buf->freeNext;
+		}
+		elog(LOG, "freelist %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+			 node,
+			 freelist->firstFreeBuffer,
+			 freelist->consumed,
+			 remain, actually_free);
+	}
+}
 
 /*
  * StrategyShmemSize
@@ -445,12 +673,28 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	/* FIXME */
+#ifdef USE_LIBNUMA
+	strategy_ncpus = numa_num_task_cpus();
+	strategy_nnodes = numa_num_task_nodes();
+#else
+	strategy_ncpus = 1;
+	strategy_nnodes = 1;
+#endif
+
+	num_partitions = calculate_partition_count(strategy_nnodes);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
-	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+	size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+	/* size of freelist partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+											num_partitions)));
 
 	return size;
 }
@@ -466,6 +710,13 @@ void
 StrategyInitialize(bool init)
 {
 	bool		found;
+	int			buffers_per_partition;
+
+	int			num_partitions;
+	int			num_partitions_per_group;
+
+	/* */
+	num_partitions = calculate_partition_count(strategy_nnodes);
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
@@ -484,23 +735,28 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
 						&found);
 
 	if (!found)
 	{
+		int32	numBuffers = NBuffers / num_partitions;
+
+		while (numBuffers * num_partitions < NBuffers)
+			numBuffers++;
+
+		Assert(numBuffers * num_partitions == NBuffers);
+
 		/*
 		 * Only done once, usually in postmaster
 		 */
 		Assert(init);
 
-		SpinLockInit(&StrategyControl->buffer_strategy_lock);
+		/* register callback to dump some stats on exit */
+		before_shmem_exit(freelist_before_shmem_exit, 0);
 
-		/*
-		 * Grab the whole linked list of free buffers for our strategy. We
-		 * assume it was previously set up by BufferManagerShmemInit().
-		 */
-		StrategyControl->firstFreeBuffer = 0;
+		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +767,68 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* always a multiple of NUMA nodes */
+		Assert(num_partitions % strategy_nnodes == 0);
+
+		num_partitions_per_group = (num_partitions / strategy_nnodes);
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_partitions_groups = strategy_nnodes;
+		StrategyControl->num_partitions_per_group = num_partitions_per_group;
+
+		/*
+		 * Rebuild the freelist - right now all buffers are in one huge list,
+		 * we want to rework that into multiple lists. Start by initializing
+		 * the strategy to have empty lists.
+		 */
+		for (int nfreelist = 0; nfreelist < num_partitions; nfreelist++)
+		{
+			BufferStrategyFreelist *freelist;
+
+			freelist = &StrategyControl->freelists[nfreelist];
+
+			freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+			SpinLockInit(&freelist->freelist_lock);
+		}
+
+		/* buffers per partition */
+		buffers_per_partition = (NBuffers / num_partitions);
+
+		elog(LOG, "NBuffers: %d, nodes %d, ncpus: %d, divide: %d, remain: %d",
+			 NBuffers, strategy_nnodes, strategy_ncpus,
+			 buffers_per_partition, NBuffers - (num_partitions * buffers_per_partition));
+
+		/*
+		 * Walk through the buffers, add them to the correct list. Walk from
+		 * the end, because we're adding the buffers to the beginning.
+		 */
+		for (int i = NBuffers - 1; i >= 0; i--)
+		{
+			BufferDesc *buf = GetBufferDescriptor(i);
+			BufferStrategyFreelist *freelist;
+			int			node;
+			int			index;
+
+			/*
+			 * Split the freelist into partitions, if needed (or just keep the
+			 * freelist we already built in BufferManagerShmemInit().
+			 */
+
+			/* determine NUMA node for buffer, this determines the group */
+			node = BufferGetNode(i);
+
+			/* now calculate the actual freelist index */
+			index = node * num_partitions_per_group + (i % num_partitions_per_group);
+
+			/* add to the right freelist */
+			freelist = &StrategyControl->freelists[index];
+
+			buf->freeNext = freelist->firstFreeBuffer;
+			freelist->firstFreeBuffer = i;
+		}
 	}
 	else
 		Assert(!init);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..a11bc71a386 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int			MaxBackends = 0;
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
+bool		numa_partition_freelist = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a21f20800fb..0552ed62cc7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2136,6 +2136,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_partition_freelist", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+			gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+		},
+		&numa_partition_freelist,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..66baf2bf33e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT bool numa_partition_freelist;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c257c8a1c20..efb7e28c10f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -93,6 +93,14 @@ typedef enum ExtendBufferedFlags
 	EB_LOCK_TARGET = (1 << 5),
 }			ExtendBufferedFlags;
 
+typedef enum FreelistPartitionMode
+{
+	FREELIST_PARTITION_NONE,
+	FREELIST_PARTITION_NODE,
+	FREELIST_PARTITION_CPU,
+	FREELIST_PARTITION_PID,
+}			FreelistPartitionMode;
+
 /*
  * Some functions identify relations either by relation or smgr +
  * relpersistence.  Used via the BMR_REL()/BMR_SMGR() macros below.  This
-- 
2.49.0

v2-0003-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v2-0003-freelist-Don-t-track-tail-of-a-freelist.patchDownload

From 2faefc2d10dcd9e31e96be5565e82d1904bd7280 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v2 3/7] freelist: Don't track tail of a freelist

The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
 src/backend/storage/buffer/freelist.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
 	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
-	int			lastFreeBuffer; /* Tail of list of unused buffers */
-
-	/*
-	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
-	 * when the list is empty)
-	 */
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
 		buf->freeNext = StrategyControl->firstFreeBuffer;
-		if (buf->freeNext < 0)
-			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}
 
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
 		 * assume it was previously set up by BufferManagerShmemInit().
 		 */
 		StrategyControl->firstFreeBuffer = 0;
-		StrategyControl->lastFreeBuffer = NBuffers - 1;
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
-- 
2.49.0

v2-0002-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v2-0002-NUMA-localalloc.patchDownload

From c0acd3385fa961e56eb435b85bb021e7ce9e2cb8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v2 2/7] NUMA: localalloc

Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).

This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.

XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/init/miscinit.c   | 16 ++++++++++++++++
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 28 insertions(+)

diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int			MaxBackends = 0;
 
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
+bool		numa_localalloc = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 43b4dbccc3d..d11936691b2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
 #include <arpa/inet.h>
 #include <utime.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
@@ -164,6 +168,18 @@ InitPostmasterChild(void)
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
 #endif
+
+#ifdef USE_LIBNUMA
+	/*
+	 * Set the default allocation policy to local node, where the task is
+	 * executing at the time of a page fault.
+	 *
+	 * XXX I believe this is not necessary, now that we don't use automatic
+	 * interleaving (numa_set_interleave_mask).
+	 */
+	if (numa_localalloc)
+		numa_set_localalloc();
+#endif
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9570087aa60..a21f20800fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables setting the default allocation policy to local node."),
+			gettext_noop("When enabled, allocate from the node where the task is executing."),
+		},
+		&numa_localalloc,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.49.0

v2-0001-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v2-0001-NUMA-interleaving-buffers.patchDownload

From 1eab6285dab1fdc78d80f6054ec3278624a662f1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 6 May 2025 21:12:21 +0200
Subject: [PATCH v2 1/7] NUMA: interleaving buffers

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure not to
split a buffer on different NUMA nodes (which with the regular
interleaving is guaranteed to happen, unless when using huge pages). The
patch performs "explicit" interleaving, so that buffers are not split
like this.

The patch maps both buffers and buffer descriptors, so that the buffer
and it's buffer descriptor end up on the same NUMA node.

The mapping happens in larger chunks (see choose_chunk_items). This is
required to handle buffer descriptors (which are smaller than buffers),
and it should also help to reduce the number of mappings. Most NUMA
systems will use 1GB chunks, unless using very small shared buffers.

Notes:

* The feature is enabled by numa_buffers_interleave GUC (false by default)

* It's not clear we want to enable interleaving for all shared memory.
  We probably want that for shared buffers, but maybe not for ProcArray
  or freelists.

* Similar questions are about huge pages - in general it's a good idea,
  but maybe it's not quite good for ProcArray. It's somewhate separate
  from NUMA, but not entirely because NUMA works on page granularity.
  PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
  as we don't want to split the entry to multiple nodes. But could be
  done explicitly, by specifying which node to use for the pages.

* We could partition ProcArray, with one partition per NUMA node, and
  then at connection time pick a node from the same node. The process
  could migrate to some other node later, especially for long-lived
  connections, but there's no perfect solution, Maybe we could set
  affinity to cores from the same node, or something like that?
---
 src/backend/storage/buffer/buf_init.c | 384 +++++++++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c   |   1 +
 src/backend/utils/init/globals.c      |   3 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/bin/pgbench/pgbench.c             |  67 ++---
 src/include/miscadmin.h               |   2 +
 src/include/storage/bufmgr.h          |   1 +
 7 files changed, 427 insertions(+), 41 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..2ad34624c49 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -25,6 +33,19 @@ WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
 
+static Size get_memory_page_size(void);
+static int64 choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes);
+static void pg_numa_interleave_memory(char *startptr, char *endptr,
+									  Size mem_page_size, Size chunk_size,
+									  int num_nodes);
+
+/* number of buffers allocated on the same NUMA node */
+static int64 numa_chunk_buffers = -1;
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;
+
+
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	Size		mem_page_size;
+	Size		buffer_align;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+	 * on page granularity, and we don't want a buffer to get split to
+	 * multiple nodes (when using multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +195,63 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Assign chunks of buffers and buffer descriptors to the available
+		 * NUMA nodes. We can't use the regular interleaving, because with
+		 * regular memory pages (smaller than BLCKSZ) we'd split all buffers
+		 * to multiple NUMA nodes. And we don't want that.
+		 *
+		 * But even with huge pages it seems like a good idea to not have
+		 * mapping for each page.
+		 *
+		 * So we always assign a larger contiguous chunk of buffers to the
+		 * same NUMA node, as calculated by choose_chunk_buffers(). We try to
+		 * keep the chunks large enough to work both for buffers and buffer
+		 * descriptors, but not too large. See the comments at
+		 * choose_chunk_buffers() for details.
+		 *
+		 * Thanks to the earlier alignment (to memory page etc.), we know the
+		 * buffers won't get split, etc.
+		 *
+		 * This also makes it easier / straightforward to calculate which NUMA
+		 * node a buffer belongs to (it's a matter of divide + mod). See
+		 * BufferGetNode().
+		 */
+		if (numa_buffers_interleave)
+		{
+			char	   *startptr,
+					   *endptr;
+			Size		chunk_size;
+
+			numa_nodes = numa_num_configured_nodes();
+
+			numa_chunk_buffers
+				= choose_chunk_buffers(NBuffers, mem_page_size, numa_nodes);
+
+			elog(LOG, "BufferManagerShmemInit num_nodes %d chunk_buffers %ld",
+				 numa_nodes, numa_chunk_buffers);
+
+			/* first map buffers */
+			startptr = BufferBlocks;
+			endptr = startptr + ((Size) NBuffers) * BLCKSZ;
+			chunk_size = (numa_chunk_buffers * BLCKSZ);
+
+			pg_numa_interleave_memory(startptr, endptr,
+									  mem_page_size,
+									  chunk_size,
+									  numa_nodes);
+
+			/* now do the same for buffer descriptors */
+			startptr = (char *) BufferDescriptors;
+			endptr = startptr + ((Size) NBuffers) * sizeof(BufferDescPadded);
+			chunk_size = (numa_chunk_buffers * sizeof(BufferDescPadded));
+
+			pg_numa_interleave_memory(startptr, endptr,
+									  mem_page_size,
+									  chunk_size,
+									  numa_nodes);
+		}
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -144,6 +284,11 @@ BufferManagerShmemInit(void)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 	}
 
+	/*
+	 * As this point we have all the buffers in a single long freelist. With
+	 * freelist partitioning we rebuild them in StrategyInitialize.
+	 */
+
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
@@ -152,24 +297,72 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
+
 /*
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
+	Size		mem_page_size;
+
+	/* XXX why does IsUnderPostmaster matter? */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
 
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -186,3 +379,178 @@ BufferManagerShmemSize(void)
 
 	return size;
 }
+
+/*
+ * choose_chunk_buffers
+ *		choose the number of buffers allocated to a NUMA node at once
+ *
+ * We don't map shared buffers to NUMA nodes one by one, but in larger chunks.
+ * This is both for efficiency reasons (fewer mappings), and also because we
+ * want to map buffer descriptors too - and descriptors are much smaller. So
+ * we pick a number that's high enough for descriptors to use whole pages.
+ *
+ * We also want to keep buffers somehow evenly distributed on nodes, with
+ * about NBuffers/nodes per node. So we don't use chunks larger than this,
+ * to keep it as fair as possible (the chunk size is a possible difference
+ * between memory allocated to different NUMA nodes).
+ *
+ * It's possible shared buffers are so small this is not possible (i.e.
+ * it's less than chunk_size). But sensible NUMA systems will use a lot
+ * of memory, so this is unlikely.
+ *
+ * We simply print a warning about the misbalance, and that's it.
+ *
+ * XXX It'd be good to ensure the chunk size is a power-of-2, because then
+ * we could calculate the NUMA node simply by shift/modulo, while now we
+ * have to do a division. But we don't know how many buffers and buffer
+ * descriptors fits into a memory page. It may not be a power-of-2.
+ */
+static int64
+choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
+{
+	int64		num_items;
+	int64		max_items;
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of items to fill a memory page with descriptors and
+	 * blocks. The NUMA allocates memory in pages, and we need to do that for
+	 * both buffers and descriptors.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 * But it's clearer this way.
+	 */
+	num_items = Max(mem_page_size / sizeof(BufferDescPadded),
+					mem_page_size / BLCKSZ);
+
+	/*
+	 * We shouldn't use chunks larger than NBuffers/num_nodes, because with
+	 * larger chunks the last NUMA node would end up with much less memory (or
+	 * no memory at all).
+	 */
+	max_items = (NBuffers / num_nodes);
+
+	/*
+	 * Did we already exceed the maximum desirable chunk size? That is, will
+	 * the last node get less than one whole chunk (or no memory at all)?
+	 */
+	if (num_items > max_items)
+		elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
+			 num_items, max_items);
+
+	/* grow the chunk size until we hit the max limit. */
+	while (2 * num_items <= max_items)
+		num_items *= 2;
+
+	/*
+	 * XXX It's not difficult to construct cases where we end up with not
+	 * quite balanced distribution. For example, with shared_buffers=10GB and
+	 * 4 NUMA nodes, we end up with 2GB chunks, which means the first node
+	 * gets 4GB, and the three other nodes get 2GB each.
+	 *
+	 * We could be smarter, and try to get more balanced distribution. We
+	 * could simply reduce max_items e.g. to
+	 *
+	 * max_items = (NBuffers / num_nodes) / 4;
+	 *
+	 * in which cases we'd end up with 512MB chunks, and each nodes would get
+	 * the same 2.5GB chunk. It may not always work out this nicely, but it's
+	 * better than with (NBuffers / num_nodes).
+	 *
+	 * Alternatively, we could "backtrack" - try with the large max_items,
+	 * check how balanced it is, and if it's too imbalanced, try with a
+	 * smaller one.
+	 *
+	 * We however want a simple scheme.
+	 */
+
+	return num_items;
+}
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA interleaving */
+	if (numa_chunk_buffers == -1)
+		return -1;
+
+	return (buffer / numa_chunk_buffers) % numa_nodes;
+}
+
+/*
+ * pg_numa_interleave_memory
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ *              of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_interleave_memory(char *startptr, char *endptr,
+						  Size mem_page_size, Size chunk_size,
+						  int num_nodes)
+{
+	volatile uint64 touch pg_attribute_unused();
+	char	   *ptr = startptr;
+
+	/* chunk size has to be a multiple of memory page */
+	Assert((chunk_size % mem_page_size) == 0);
+
+	/*
+	 * Walk the memory pages in the range, and determine the node for each
+	 * one. We use numa_tonode_memory(), because then we can move a whole
+	 * memory range to the node, we don't need to worry about individual pages
+	 * like with numa_move_pages().
+	 */
+	while (ptr < endptr)
+	{
+		/* We may have an incomplete chunk at the end. */
+		Size		sz = Min(chunk_size, (endptr - ptr));
+
+		/*
+		 * What NUMA node does this range belong to? Each chunk should go to
+		 * the same NUMA node, in a round-robin manner.
+		 */
+		int			node = ((ptr - startptr) / chunk_size) % num_nodes;
+
+		/*
+		 * Nope, we have the first buffer from the next memory page, and we'll
+		 * set NUMA node for it (and all pages up to the next buffer). The
+		 * buffer should align with the memory page, thanks to the
+		 * buffer_align earlier.
+		 */
+		Assert((int64) ptr % mem_page_size == 0);
+		Assert((sz % mem_page_size) == 0);
+
+		/*
+		 * XXX no return value, to make this fail on error, has to use
+		 * numa_set_strict
+		 *
+		 * XXX Should we still touch the memory first, like with numa_move_pages,
+		 * or is that not necessary?
+		 */
+		numa_tonode_memory(ptr, sz, node);
+
+		ptr += sz;
+	}
+
+	/* should have processed all chunks */
+	Assert(ptr == endptr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 94db3e7c976..5922689fe5d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -685,6 +685,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	BufferDesc *bufHdr;
 	BufferTag	tag;
 	uint32		buf_state;
+
 	Assert(BufferIsValid(recent_buffer));
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+/* NUMA stuff */
+bool		numa_buffers_interleave = false;
+
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..9570087aa60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of shared buffers."),
+			gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+		},
+		&numa_buffers_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 69b6a877dc9..c07de903f76 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -305,7 +305,7 @@ static const char *progname;
 #define	CPU_PINNING_RANDOM		1
 #define	CPU_PINNING_COLOCATED	2
 
-static int pinning_mode = CPU_PINNING_NONE;
+static int	pinning_mode = CPU_PINNING_NONE;
 
 #define WSEP '@'				/* weight separator */
 
@@ -874,20 +874,20 @@ static bool socket_has_input(socket_set *sa, int fd, int idx);
  */
 typedef struct cpu_generator_state
 {
-	int		ncpus;		/* number of CPUs available */
-	int		nitems;		/* number of items in the queue */
-	int	   *nthreads;	/* number of threads for each CPU */
-	int	   *nclients;	/* number of processes for each CPU */
-	int	   *items;		/* queue of CPUs to pick from */
-} cpu_generator_state;
+	int			ncpus;			/* number of CPUs available */
+	int			nitems;			/* number of items in the queue */
+	int		   *nthreads;		/* number of threads for each CPU */
+	int		   *nclients;		/* number of processes for each CPU */
+	int		   *items;			/* queue of CPUs to pick from */
+}			cpu_generator_state;
 
 static cpu_generator_state cpu_generator_init(int ncpus);
-static void cpu_generator_refill(cpu_generator_state *state);
-static void cpu_generator_reset(cpu_generator_state *state);
-static int cpu_generator_thread(cpu_generator_state *state);
-static int cpu_generator_client(cpu_generator_state *state, int thread_cpu);
-static void cpu_generator_print(cpu_generator_state *state);
-static bool cpu_generator_check(cpu_generator_state *state);
+static void cpu_generator_refill(cpu_generator_state * state);
+static void cpu_generator_reset(cpu_generator_state * state);
+static int	cpu_generator_thread(cpu_generator_state * state);
+static int	cpu_generator_client(cpu_generator_state * state, int thread_cpu);
+static void cpu_generator_print(cpu_generator_state * state);
+static bool cpu_generator_check(cpu_generator_state * state);
 
 static void reset_pinning(TState *threads, int nthreads);
 
@@ -7422,7 +7422,7 @@ main(int argc, char **argv)
 	/* try to assign threads/clients to CPUs */
 	if (pinning_mode != CPU_PINNING_NONE)
 	{
-		int nprocs = get_nprocs();
+		int			nprocs = get_nprocs();
 		cpu_generator_state state = cpu_generator_init(nprocs);
 
 retry:
@@ -7433,6 +7433,7 @@ retry:
 		for (i = 0; i < nthreads; i++)
 		{
 			TState	   *thread = &threads[i];
+
 			thread->cpu = cpu_generator_thread(&state);
 		}
 
@@ -7444,7 +7445,7 @@ retry:
 		while (true)
 		{
 			/* did we find any unassigned backend? */
-			bool found = false;
+			bool		found = false;
 
 			for (i = 0; i < nthreads; i++)
 			{
@@ -7678,10 +7679,10 @@ threadRun(void *arg)
 		/* determine PID of the backend, pin it to the same CPU */
 		for (int i = 0; i < nstate; i++)
 		{
-			char   *pid_str;
-			pid_t	pid;
+			char	   *pid_str;
+			pid_t		pid;
 
-			PGresult *res = PQexec(state[i].con, "select pg_backend_pid()");
+			PGresult   *res = PQexec(state[i].con, "select pg_backend_pid()");
 
 			if (PQresultStatus(res) != PGRES_TUPLES_OK)
 				pg_fatal("could not determine PID of the backend for client %d",
@@ -8184,7 +8185,7 @@ cpu_generator_init(int ncpus)
 {
 	struct timeval tv;
 
-	cpu_generator_state	state;
+	cpu_generator_state state;
 
 	state.ncpus = ncpus;
 
@@ -8207,7 +8208,7 @@ cpu_generator_init(int ncpus)
 }
 
 static void
-cpu_generator_refill(cpu_generator_state *state)
+cpu_generator_refill(cpu_generator_state * state)
 {
 	struct timeval tv;
 
@@ -8223,7 +8224,7 @@ cpu_generator_refill(cpu_generator_state *state)
 }
 
 static void
-cpu_generator_reset(cpu_generator_state *state)
+cpu_generator_reset(cpu_generator_state * state)
 {
 	state->nitems = 0;
 	cpu_generator_refill(state);
@@ -8236,15 +8237,15 @@ cpu_generator_reset(cpu_generator_state *state)
 }
 
 static int
-cpu_generator_thread(cpu_generator_state *state)
+cpu_generator_thread(cpu_generator_state * state)
 {
 	if (state->nitems == 0)
 		cpu_generator_refill(state);
 
 	while (true)
 	{
-		int idx = lrand48() % state->nitems;
-		int cpu = state->items[idx];
+		int			idx = lrand48() % state->nitems;
+		int			cpu = state->items[idx];
 
 		state->items[idx] = state->items[state->nitems - 1];
 		state->nitems--;
@@ -8256,10 +8257,10 @@ cpu_generator_thread(cpu_generator_state *state)
 }
 
 static int
-cpu_generator_client(cpu_generator_state *state, int thread_cpu)
+cpu_generator_client(cpu_generator_state * state, int thread_cpu)
 {
-	int		min_clients;
-	bool	has_valid_cpus = false;
+	int			min_clients;
+	bool		has_valid_cpus = false;
 
 	for (int i = 0; i < state->nitems; i++)
 	{
@@ -8284,8 +8285,8 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
 
 	while (true)
 	{
-		int idx = lrand48() % state->nitems;
-		int cpu = state->items[idx];
+		int			idx = lrand48() % state->nitems;
+		int			cpu = state->items[idx];
 
 		if (cpu == thread_cpu)
 			continue;
@@ -8303,7 +8304,7 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
 }
 
 static void
-cpu_generator_print(cpu_generator_state *state)
+cpu_generator_print(cpu_generator_state * state)
 {
 	for (int i = 0; i < state->ncpus; i++)
 	{
@@ -8312,10 +8313,10 @@ cpu_generator_print(cpu_generator_state *state)
 }
 
 static bool
-cpu_generator_check(cpu_generator_state *state)
+cpu_generator_check(cpu_generator_state * state)
 {
-	int	min_count = INT_MAX,
-		max_count = 0;
+	int			min_count = INT_MAX,
+				max_count = 0;
 
 	for (int i = 0; i < state->ncpus; i++)
 	{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
 extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..c257c8a1c20 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -319,6 +319,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
-- 
2.49.0

#42

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#7)

Re: Adding basic NUMA awareness

On 7/4/25 20:12, Tomas Vondra wrote:

On 7/4/25 13:05, Jakub Wartak wrote:

...

8. v1-0005 2x + /* if (numa_procs_interleave) */

Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down

[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0

Yeah, good catch. I'll look into that next week.

I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?

regards

--
Tomas Vondra

#43

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#41)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:

Here's a v2 of the patch series, with a couple changes:

Not a deep look at the code, just a quick reply.

* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So

I assume s/not/now/?

* There's now a patch partitioning clocksweep, using the same scheme as
the freelists.

Nice!

I came to the conclusion it doesn't make much sense to partition these
things differently - I can't think of a reason why that would be
advantageous, and it makes it easier to reason about.

Agreed.

The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.

That probably makes sense for now. It might need a bit of a larger adjustment
at some point, but ...

* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.

I think it's totally fair to enable/disable them at the same time. They're so
closely related, that I don't think it really makes sense to measure them
separately.

I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.

Personally I think an 1.18x improvement on a relatively small NUMA machine is
really rather awesome.

For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.

Was that with pinning etc enabled or not?

From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v2 5/7] NUMA: clockweep partitioning

@@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
/*
* Nothing on the freelist, so run the "clock sweep" algorithm
*
-	 * XXX Should we also make this NUMA-aware, to only access buffers from
-	 * the same NUMA node? That'd probably mean we need to make the clock
-	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
-	 * subset of buffers. But that also means each process could "sweep" only
-	 * a fraction of buffers, even if the other buffers are better candidates
-	 * for eviction. Would that also mean we'd have multiple bgwriters, one
-	 * for each node, or would one bgwriter handle all of that?
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That
+	 * means it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists
+	 * or other nodes?

I think we *definitely* need "stealing" from other clock sweeps, whenever
there's a meaningful imbalance between the different sweeps.

I don't think we need to be overly precise about it, a small imbalance won't
have that much of an effect. But clearly it doesn't make sense to say that one
backend can only fill buffers in the current partition, that'd lead to massive
performance issues in a lot of workloads.

The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.

One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.

Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.

Greetings,

Andres Freund

#44

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#43)

Re: Adding basic NUMA awareness

On 7/18/25 18:46, Andres Freund wrote:

Hi,

On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:

Here's a v2 of the patch series, with a couple changes:

Not a deep look at the code, just a quick reply.

* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So

I assume s/not/now/?

Yes.

* There's now a patch partitioning clocksweep, using the same scheme as
the freelists.

Nice!

I came to the conclusion it doesn't make much sense to partition these
things differently - I can't think of a reason why that would be
advantageous, and it makes it easier to reason about.

Agreed.

The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.

That probably makes sense for now. It might need a bit of a larger

adjustment at some point, but ...

I couldn't think of something fundamentally better and not too complex.
I suspect we might want to use multiple bgwriters in the future, and
this scheme seems to be reasonably well suited for that too.

I'm also thinking about having some sort of "unified" partitioning
scheme for all the places partitioning shared buffers. Right now each of
the places does it on it's own, i.e. buff_init, freelist and clocksweep
all have their code splitting NBuffers into partitions. And it should
align. Because what would be the benefit if it didn't? But I guess
having three variants of the same code seems a bit pointless.

I think buff_init should build a common definition of buffer partitions,
and the remaining parts should use that as the source of truth ...

* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.

I think it's totally fair to enable/disable them at the same time. They're so
closely related, that I don't think it really makes sense to measure them
separately.

Yeah, that's a fair point.

I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.

Personally I think an 1.18x improvement on a relatively small NUMA machine is
really rather awesome.

True, but I want to stress out it's just one quick (& simple test). Much
more testing is needed before I can make reliable claims.

For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.

Was that with pinning etc enabled or not?

IIRC it was with everything enabled, except for numa_procs_pin (which
pins backend to NUMA node). I found that to actually harm performance in
some of the tests (even just read-only ones), resulting in uneven usage
of cores and lower throughput.

From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v2 5/7] NUMA: clockweep partitioning
@@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
/*
* Nothing on the freelist, so run the "clock sweep" algorithm
*
-	 * XXX Should we also make this NUMA-aware, to only access buffers from
-	 * the same NUMA node? That'd probably mean we need to make the clock
-	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
-	 * subset of buffers. But that also means each process could "sweep" only
-	 * a fraction of buffers, even if the other buffers are better candidates
-	 * for eviction. Would that also mean we'd have multiple bgwriters, one
-	 * for each node, or would one bgwriter handle all of that?
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That
+	 * means it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists
+	 * or other nodes?
I think we *definitely* need "stealing" from other clock sweeps, whenever
there's a meaningful imbalance between the different sweeps.

I don't think we need to be overly precise about it, a small imbalance won't
have that much of an effect. But clearly it doesn't make sense to say that one
backend can only fill buffers in the current partition, that'd lead to massive
performance issues in a lot of workloads.

Agreed.

The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.

One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.

What you mean by "it"? What would happen after a sweep wraps around?

Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.

I'm afraid having hard "avoid" flags would lead to sudden and unexpected
changes in performance as we enable/disable partitions. I think a good
solution should "smooth it out" somehow, e.g. by not having a true/false
flag, but having some sort of "preference" factor with values between
(0.0, 1.0) which says how much we should use that partition.

I was imagining something like this:

Say we know the number of buffers allocated for each partition (in the
last round), and we (or rather the BgBufferSync) calculate:

coefficient = 1.0 - (nallocated_partition / nallocated)

and then use that to "correct" which partition to allocate buffers from.
Or maybe just watch how far from the "fair share" we were in the last
interval, and gradually increase/decrease the "partition preference"
which would say how often we need to "steal" from other partitions.

E.g. we find nallocated_partition is 2x the fair share, i.e.

nallocated_partition / (nallocated / nparts) = 2.0

Then we say 25% of the time look at some other partition, to "cut" the
imbalance in half. And then repeat that in the next cycle, etc.

So a process would look at it's "home partition" by default, but it's
"roll a dice" first and if above the calculated probability it'd pick
some other partition instead (this would need to be done so that it gets
balanced overall).

If the bgwriter interval is too long, maybe the recalculation could be
triggered regularly after any of the clocksweeps wraps around, or after
some number of allocations, or something like that.

regards

--
Tomas Vondra

#45

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#44)

Re: Adding basic NUMA awareness

Hi,

On 2025-07-18 22:48:00 +0200, Tomas Vondra wrote:

On 7/18/25 18:46, Andres Freund wrote:

For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.

Was that with pinning etc enabled or not?

IIRC it was with everything enabled, except for numa_procs_pin (which
pins backend to NUMA node). I found that to actually harm performance in
some of the tests (even just read-only ones), resulting in uneven usage
of cores and lower throughput.

FWIW, I really doubt that something like numa_procs_pin is viable outside of
very narrow niches until we have a *lot* more infrastructure in place. Like PG
would need to be threaded, we'd need a separation between thread and
connection and an executor that'd allow us to switch from working on one query
to working on another query.

The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.

One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.

What you mean by "it"?

it := Considering switching back from using a "foreign" clock sweep instance
whenever the sweep wraps around.

What would happen after a sweep wraps around?

The scenario I'm worried about is this:

1) a bunch of backends read buffers on numa node A, using the local clock
sweep instance

2) due to all of that activity, the clock sweep advances much faster than the
clock sweep for numa node B

3) the clock sweep on A wraps around, we discover the imbalance, and all the
backend switch to scanning on numa node B, moving that clock sweep ahead
much more aggressively

4) clock sweep on B wraps around, there's imbalance the other way round now,
so they all switch back to A

Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.

I'm afraid having hard "avoid" flags would lead to sudden and unexpected
changes in performance as we enable/disable partitions. I think a good
solution should "smooth it out" somehow, e.g. by not having a true/false
flag, but having some sort of "preference" factor with values between
(0.0, 1.0) which says how much we should use that partition.

Yea, I think that's a fair worry.

I was imagining something like this:

Say we know the number of buffers allocated for each partition (in the
last round), and we (or rather the BgBufferSync) calculate:

coefficient = 1.0 - (nallocated_partition / nallocated)

and then use that to "correct" which partition to allocate buffers from.
Or maybe just watch how far from the "fair share" we were in the last
interval, and gradually increase/decrease the "partition preference"
which would say how often we need to "steal" from other partitions.

E.g. we find nallocated_partition is 2x the fair share, i.e.

nallocated_partition / (nallocated / nparts) = 2.0

Then we say 25% of the time look at some other partition, to "cut" the
imbalance in half. And then repeat that in the next cycle, etc.

So a process would look at it's "home partition" by default, but it's
"roll a dice" first and if above the calculated probability it'd pick
some other partition instead (this would need to be done so that it gets
balanced overall).

That does sound reasonable.

If the bgwriter interval is too long, maybe the recalculation could be
triggered regularly after any of the clocksweeps wraps around, or after
some number of allocations, or something like that.

I'm pretty sure the bgwriter might not be often enough and not predictably
frequently running for that.

Greetings,

Andres Freund

#46

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Tomas Vondra (#42)

Re: Adding basic NUMA awareness

On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote:

On 7/4/25 20:12, Tomas Vondra wrote:

On 7/4/25 13:05, Jakub Wartak wrote:

...

8. v1-0005 2x + /* if (numa_procs_interleave) */

Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down

[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0

Yeah, good catch. I'll look into that next week.

I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?

Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:

For complete picture, steps:

1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch

BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej

3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.

4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)

5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok

6. restart with numa_procs_interleave=off, which gets me to every bg
worker crashing e.g.:

(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at
./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x0000563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000563e2d774d93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
#4 0x0000563e2d6e9252 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000563e2d6eb683 in CheckpointerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/postmaster/checkpointer.c:190
#6 0x0000563e2d6ec363 in postmaster_child_launch
(child_type=child_type@entry=B_CHECKPOINTER, child_slot=249,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x0000563e2d6ee29a in StartChildProcess
(type=type@entry=B_CHECKPOINTER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x0000563e2d6f17a6 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x563e377cc0e0) at
../src/backend/postmaster/postmaster.c:1386
#9 0x0000563e2d4948fc in main (argc=3, argv=0x563e377cc0e0) at
../src/backend/main/main.c:231

notice sema=0x0, because:
#3 0x000056050928cd93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
995 PGSemaphoreReset(MyProc->sem);
(gdb) print MyProc
$1 = (PGPROC *) 0x7f09a0c013b0
(gdb) print MyProc->sem
$2 = (PGSemaphore) 0x0

or with printfs:

2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal
PGPROC=0x7f9de827b880 requestSize=148770
// after proc && ptr manipulation:
2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal
PGPROC=0x7f9de827bdf0 requestSize=148770 procs=0x7f9de827b880
ptr=0x7f9de827bdf0
[..initialization of aux PGPROCs i=0.., still fromInitProcGlobal(),
each gets proper sem allocated as one would expect:]
[..for i loop:]
2025-07-25 11:17:23.689 CEST [21772] LOG: i=136 ,
proc=0x7f9de8600000, proc->sem=0x7f9da4e04438
2025-07-25 11:17:23.689 CEST [21772] LOG: i=137 ,
proc=0x7f9de8600348, proc->sem=0x7f9da4e044b8
2025-07-25 11:17:23.689 CEST [21772] LOG: i=138 ,
proc=0x7f9de8600690, proc->sem=0x7f9da4e04538
[..but then in the children codepaths, out of the blue in
InitAuxilaryProcess the whole MyProc looks like it would memsetted to
zeros:]
2025-07-25 11:17:23.693 CEST [21784] LOG: auxiliary process using
MyProc=0x7f9de8600000 auxproc=0x7f9de8600000 proctype=0
MyProcPid=21784 MyProc->sem=(nil)

above got pgproc slot i=136 with addr 0x7f9de8600000 and later that
auxiliary is launched but somehow something NULLified ->sem there
(according to gdb , everything is zero there)

7. Original patch v2-0006 (with commented out 2x if
numa_procs_interleave), behaves OK, so in my case here with 1x NUMA
node that gives add_size(.., 1+1 * 2MB)=4MB

2025-07-25 11:38:54.131 CEST [23939] LOG: in InitProcGlobal
PGPROC=0x7f25cbe7b880 requestSize=4343074
2025-07-25 11:38:54.132 CEST [23939] LOG: in InitProcGlobal
PGPROC=0x7f25cbe7bdf0 requestSize=4343074 procs=0x7f25cbe7b880
ptr=0x7f25cbe7bdf0

so something is zeroing out all those MyProc structures apparently on
startup (probably due to some wrong alignment maybe somewhere ?) I was
thinking about trapping via mprotect() this single i=136
0x7f9de8600000 PGPROC to see what is resetting it, but oh well,
mprotect() works only on whole pages...

-J.

#47

tomas@vondra.me

6 months ago

In reply to: Jakub Wartak (#46)

Re: Adding basic NUMA awareness

On 7/25/25 12:27, Jakub Wartak wrote:

On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote:

On 7/4/25 20:12, Tomas Vondra wrote:

On 7/4/25 13:05, Jakub Wartak wrote:

...

8. v1-0005 2x + /* if (numa_procs_interleave) */

Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down

[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0

Yeah, good catch. I'll look into that next week.

I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?

Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:

For complete picture, steps:

1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch

BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej

3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.

4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)

Ah, I didn't realize you uncommented these "if" conditions. In that case
the crash is not very surprising, because the actual initialization in
InitProcGlobal ignores the GUCs and just assumes it's enabled. But
without the extra padding that likely messes up something. Or something
allocated later "overwrites" the some of the memory.

I need to clean this up, to actually consider the GUC properly.

FWIW I do have a new patch version that I plan to share in a day or two,
once I get some numbers. It didn't change this particular part, though,
it's more about the buffers/freelists/clocksweep. I'll work on PGPROC
next, I think.

regards

--
Tomas Vondra

#48

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#47)

12 attachment(s)

Re: Adding basic NUMA awareness

Hi,

Here's a somewhat cleaned up v3 of this patch series, with various
improvements and a lot of cleanup. Still WIP, but I hope it resolves the
various crashes reported for v2, but it still requires --with-libnuma
(it won't build without it).

I'm aware there's an ongoing discussion about removing the freelists,
and changing the clocksweep in some way. If that happens, the relevant
parts of this series will need some adjustment, of course. I haven't
looked into that yet, I plan to review those patches soon.

main changes in v3
------------------

1) I've introduced "registry" of the buffer partitions (imagine a small
array of structs), serving as a source of truth for places that need
info about the partitions (range of buffers, ...).

With v2 there was no "shared definition" - the shared buffers, freelist
and clocksweep did their own thing. But per the discussion it doesn't
really make much sense for to partition buffers in different ways.

So in v3 the 0001 patch defines the partitions, records them in shared
memory (in a small array), and the later parts just reuse this.

I also added a pg_buffercache_partitions() listing the partitions, with
first/last buffer, etc. The freelist/clocksweep patches add additional
information.

2) The PGPROC part introduces a similar registry, even though there are
no other patches building on this. But it seemed useful to have a clear
place recording this info.

There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.

3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".

4) This still doesn't do anything about "balancing" the clocksweep. I
have some ideas how to do that, I'll work on that next.

simple benchmark
----------------

I did a simple benchmark, measuring pgbench throughput with scale still
fitting into RAM, but much larger (~2x) than shared buffers. See the
attached test script, testing builds with more and more of the patches.

I'm attaching results from two different machines (the "usual" 2P xeon
and also a much larger cloud instance with EPYC/Genoa) - both the raw
CSV files, with average tps and percentiles, and PDFs. The PDFs also
have a comparison either to the "preceding" build (right side), or to
master (below the table).

There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").

For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):

mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%

The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).

For the xeon the differences (in either direction) are much smaller, so
I'm not going to post it here. It's in the PDF, though.

I think this looks reasonable. The way I see this patch series is not
about improving peak throughput, but more about reducing imbalance and
making the behavior more consistent.

The results are more a confirmation there's not some sort of massive
overhead, somewhere. But I'll get to this in a minute.

To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.

overhead of partitioning calculation
------------------------------------

Regarding the "overhead", while the results look mostly OK, I think
we'll need to rethink the partitioning scheme - particularly how the
partition size is calculated. The current scheme has to use %, which can
be somewhat expensive.

The 0001 patch calculates a "chunk size", which is the smallest number
of buffers it can "assign" to a NUMA node. This depends on how many
buffer descriptors fit onto a single memory page, and it's either 512KB
(with 4KB pages), or 256MB (with 2MB huge pages). And then each NUMA
node gets multiple chunks, to cover shared_buffers/num_nodes. But this
can be an arbitrary number - it minimizes the imbalance, but it also
forces the use of % and / in the formulas.

AFAIK if we required the partitions to be 2^k multiples of the chunk
size, we could switch to using shifts and masking. Which is supposed to
be much faster. But I haven't measured this, and the cost is that some
of the nodes could get much less memory. Maybe that's fine.

reserving number of huge pages
------------------------------

The other thing I realized is that partitioning buffers with huge pages
is quite tricky, and can easily lead to SIGBUS when accessing the memory
later. The crashes I saw happen like this:

1) figure # of pages needed (using shared_memory_size_in_huge_pages)

This can be 16828 for shared_buffers=32GB.

2) make sure there's enough huge pages

echo 16828 > /proc/sys/vm/nr_hugepages

3) start postgres - everything seems to works just fine

4) query pg_buffercache_numa - triggers SIGBUS accessing memory for a
valid buffer (usually ~2GB from the end)

It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See

$ numastat -cm

Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.

And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.

You may ask why the per-node limit is too low. We still need just
shared_memory_size_in_huge_pages, right? And if we were partitioning the
whole memory segment, that'd be true. But we only to that for shared
buffers, and there's a lot of other shared memory - could be 1-2GB or
so, depending on the configuration.

And this gets placed on one of the nodes, and it counts against the
limit on that particular node. And so it doesn't have enough huge pages
to back the partition of shared buffers.

The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.

I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.

I also realize this can be used to make sure the memory is balanced on
NUMA systems. Because if you set nr_hugepages, the kernel will ensure
the shared memory is distributed on all the nodes.

It won't have the benefits of "coordinating" the buffers and buffer
descriptors, and so on. But it will be balanced.

regards

--
Tomas Vondra

Attachments:

v3-0001-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v3-0001-NUMA-interleaving-buffers.patchDownload

From f0eb1af6fdcfd7daae26952ddc223952333f6af2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 28 Jul 2025 14:01:37 +0200
Subject: [PATCH v3 1/7] NUMA: interleaving buffers

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, NUMA node, etc. This serves as a source
of truth, both for this patch and for later patches building on this
same buffer partition structure.

With the feature disabled (GUC set to 'off'), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* The feature is enabled by numa_buffers_interleave GUC (default: false)

* It's not clear we want to enable interleaving for all shared memory.
  We probably want that for shared buffers, but maybe not for ProcArray
  or freelists.

* Similar questions are about huge pages - in general it's a good idea,
  but maybe it's not quite good for ProcArray. It's somewhate separate
  from NUMA, but not entirely because NUMA works on page granularity.
  PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
  as we don't want to split the entry to multiple nodes. But could be
  done explicitly, by specifying which node to use for the pages.

* We could partition ProcArray, with one partition per NUMA node, and
  then at connection time pick a node from the same node. The process
  could migrate to some other node later, especially for long-lived
  connections, but there's no perfect solution, Maybe we could set
  affinity to cores from the same node, or something like that?
---
 contrib/pg_buffercache/Makefile               |   2 +-
 .../pg_buffercache--1.6--1.7.sql              |  22 +
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  92 +++
 src/backend/storage/buffer/buf_init.c         | 626 +++++++++++++++++-
 src/backend/utils/init/globals.c              |   3 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/include/miscadmin.h                       |   2 +
 src/include/storage/buf_internals.h           |   6 +
 src/include/storage/bufmgr.h                  |  15 +
 src/tools/pgindent/typedefs.list              |   2 +
 11 files changed, 771 insertions(+), 11 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..bd97246f6ab
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ae0291e6e96..8baa7c7b543 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -771,3 +773,93 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &numa_node, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_buffers);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(first_buffer);
+		nulls[3] = false;
+
+		values[4] = Int32GetDatum(last_buffer);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..5b65a855b29 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +32,19 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+BufferPartitions *BufferPartitionsArray;
+
+static Size get_memory_page_size(void);
+static void buffer_partitions_prepare(void);
+static void buffer_partitions_init(void);
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
+
 
 /*
  * Data Structures:
@@ -70,19 +91,89 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+	Size		mem_page_size;
+	Size		buffer_align;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+	 * on page granularity, and we don't want a buffer to get split to
+	 * multiple nodes (when using multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
+
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), numa_partitions),
+						&foundParts);
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +203,12 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Initialize the registry of buffer partitions, and also move the
+		 * memory to different NUMA nodes (if enabled by GUC)
+		 */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -144,6 +241,11 @@ BufferManagerShmemInit(void)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 	}
 
+	/*
+	 * As this point we have all the buffers in a single long freelist. With
+	 * freelist partitioning we rebuild them in StrategyInitialize.
+	 */
+
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
@@ -152,24 +254,68 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
+
 /*
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -184,5 +330,467 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
+
 	return size;
 }
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA interleaving */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * pg_numa_interleave_memory
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ *              of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		mem_page_size;
+	Size		sz;
+
+	/*
+	 * Get the "actual" memory page size, not the one we used for sizing. We
+	 * might have used huge page for sizing, but only get regular pages when
+	 * allocating, so we must use the smaller pages here.
+	 *
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	Assert((int64) startptr % mem_page_size == 0);
+
+	sz = (endptr - startptr);
+	numa_tonode_memory(startptr, sz, node);
+}
+
+
+#define MIN_BUFFER_PARTITIONS	4
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+	numa_nodes = numa_num_configured_nodes();
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	elog(WARNING, "IsUnderPostmaster %d", IsUnderPostmaster);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		numa_page_size = pg_get_shmem_pagesize();
+	else
+		numa_page_size = get_memory_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or can't map to that many nodes, just
+	 * take a "fair share" of buffers.
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(LOG, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (numa_partitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
+
+	BufferPartitionsArray->npartitions = numa_partitions;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
+
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
+
+		remaining_buffers -= node_buffers;
+
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
+
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+
+			/* XXX we should get the actual node ID from the mask */
+			part->numa_node = n;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(LOG, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
+	}
+
+	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (!numa_buffers_interleave || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not have mapping
+	 * for each page.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+		char	   *startptr,
+				   *endptr;
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(LOG, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %ld)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(LOG, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %ld)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *node, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*node = part->numa_node;
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
+
+void
+BufferPartitionParams(int *num_partitions, int *num_nodes)
+{
+	*num_partitions = numa_partitions;
+	*num_nodes = numa_nodes;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+/* NUMA stuff */
+bool		numa_buffers_interleave = false;
+
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..9570087aa60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of shared buffers."),
+			gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+		},
+		&numa_buffers_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
 extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..9dfbecb9fe4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -323,6 +323,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -491,4 +492,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..deaf4f19fa4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -143,6 +143,20 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+typedef struct BufferPartition
+{
+	int			numa_node;
+	int			num_buffers;
+	int			first_buffer;
+	int			last_buffer;
+} BufferPartition;
+
+typedef struct BufferPartitions
+{
+	int			npartitions;
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -319,6 +333,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3daba26b237..c695cfa76e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -346,6 +346,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.50.1

numa-hb176.csvtext/csv; charset=UTF-8; name=numa-hb176.csvDownload

run-huge-pages.shapplication/x-shellscript; name=run-huge-pages.shDownload

numa-xeon.csvtext/csv; charset=UTF-8; name=numa-xeon.csvDownload

numa-xeon-e5-2699.pdfapplication/pdf; name=numa-xeon-e5-2699.pdfDownload

numa-hb176.pdfapplication/pdf; name=numa-hb176.pdfDownload

v3-0007-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v3-0007-NUMA-pin-backends-to-NUMA-nodes.patchDownload

From 3b3a929b007205c46b3c193cf38e3edf71084af7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v3 7/7] NUMA: pin backends to NUMA nodes

When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
 src/backend/storage/lmgr/proc.c     | 21 +++++++++++++++++++++
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 33 insertions(+)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 11259151a7d..dbb4cbb1bfa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -766,6 +766,27 @@ InitProcess(void)
 	}
 	MyProcNumber = GetNumberFromPGProc(MyProc);
 
+	/*
+	 * Optionally, restrict the process to only run on CPUs from the same NUMA
+	 * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+	 * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+	 *
+	 * This also means we only do this with numa_procs_interleave, because
+	 * without that we'll have numa_node=-1 for all PGPROC entries.
+	 *
+	 * FIXME add proper error-checking for libnuma functions
+	 */
+	if (numa_procs_pin && MyProc->numa_node != -1)
+	{
+		struct bitmask *cpumask = numa_allocate_cpumask();
+
+		numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+		numa_sched_setaffinity(MyProcPid, cpumask);
+
+		numa_free_cpumask(cpumask);
+	}
+
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
 	 * the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ee4684d1b8..3f88659b49f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 bool		numa_partition_freelist = false;
 bool		numa_procs_interleave = false;
+bool		numa_procs_pin = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7b718760248..862341e137e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2156,6 +2156,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+			gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+		},
+		&numa_procs_pin,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cdeee8dccba..a97741c6707 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT bool numa_partition_freelist;
 extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.50.1

v3-0006-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v3-0006-NUMA-interleave-PGPROC-entries.patchDownload

From 5addb5973ce571debebf07b17adc07eb828a48ee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v3 6/7] NUMA: interleave PGPROC entries

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 +++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 -
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  63 +-
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 565 +++++++++++++++++-
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/include/miscadmin.h                       |   1 +
 src/include/storage/proc.h                    |  14 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 722 insertions(+), 64 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index b7d8ea45ed7..c48950a9d3b 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -23,3 +23,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 5169655ae78..22396f36c09 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/rel.h"
 
 
@@ -28,6 +29,7 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
+#define NUM_BUFFERCACHE_PGPROC_ELEM 5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -102,6 +104,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -905,3 +908,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 777c9a8d555..087279a6a8e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 5b65a855b29..fb52039e1a6 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -500,8 +500,6 @@ buffer_partitions_prepare(void)
 	if (numa_nodes < 1)
 		numa_nodes = 1;
 
-	elog(WARNING, "IsUnderPostmaster %d", IsUnderPostmaster);
-
 	/*
 	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
 	 * run outside postmaster? I don't think so.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index ff02dc8e00b..d8d602f0a4e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -416,7 +416,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf987aed8d3..3e86e4ca2ae 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1622,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1739,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2223,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2297,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2498,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2724,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2755,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2857,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3019,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3060,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3188,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3231,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3300,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3402,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3467,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3522,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3568,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3597,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3628,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3669,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3732,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3798,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..c84a2a5f1bc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..11259151a7d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -90,6 +98,31 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static Size get_memory_page_size(void); /* XXX duplicate with bufi_init.c */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -100,11 +133,63 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -129,6 +214,25 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
 	return size;
 }
 
@@ -140,6 +244,9 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
@@ -191,7 +298,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -205,6 +312,8 @@ InitProcGlobal(void)
 	Size		requestSize;
 	char	   *ptr;
 
+	Size		mem_page_size = get_memory_page_size();
+
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -241,19 +350,115 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partitoion
+		 * per NUMA node.
+		 */
+		partitions = (PGProcPartition *) ptr;
+		ptr += (numa_nodes * sizeof(PGProcPartition));
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/*
+		 * Now initialize the PGPROC partition registry with a single
+		 * partition for all the procs.
+		 */
+		partitions = (PGProcPartition *) ptr;
+		ptr += sizeof(PGProcPartition);
+
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +491,92 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
 
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
+
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
 
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
+
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
 		 * dummy PGPROCs don't need these though - they're never associated
@@ -366,15 +639,12 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
 	 */
-	AuxiliaryProcs = &procs[MaxBackends];
-	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+	AuxiliaryProcs = procs[MaxBackends];
+	PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
 	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +705,45 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if (numa_procs_interleave)
+		{
+			dlist_mutable_iter iter;
+			unsigned	cpu;
+			unsigned	node;
+			int			rc;
+
+			rc = getcpu(&cpu, &node);
+			if (rc != 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -1988,7 +2296,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2371,222 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/*
+	 * XXX This is a bit annoying/confusing, because we may get a different
+	 * result depending on when we call it. Before mmap() we don't know if the
+	 * huge pages get used, so we assume they will. And then if we don't get
+	 * huge pages, we'll waste memory etc.
+	 */
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status == HUGE_PAGES_OFF)
+		huge_page_size = 0;
+	else
+		GetHugePageSize(&huge_page_size, NULL);
+
+	return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+	numa_nodes = numa_num_configured_nodes();
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		numa_page_size = pg_get_shmem_pagesize();
+	else
+		numa_page_size = get_memory_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(LOG, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+static void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		mem_page_size;
+	Size		sz;
+
+	/*
+	 * Get the "actual" memory page size, not the one we used for sizing. We
+	 * might have used huge page for sizing, but only get regular pages when
+	 * allocating, so we must use the smaller pages here.
+	 *
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	Assert((int64) startptr % mem_page_size == 0);
+
+	sz = (endptr - startptr);
+	numa_tonode_memory(startptr, sz, node);
+}
+
+/*
+ * doesn't do alignment
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	elog(LOG, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (numa_procs_interleave && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a11bc71a386..6ee4684d1b8 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int			MaxBackends = 0;
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 bool		numa_partition_freelist = false;
+bool		numa_procs_interleave = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0552ed62cc7..7b718760248 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2146,6 +2146,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+			gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+		},
+		&numa_procs_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 66baf2bf33e..cdeee8dccba 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT bool numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..d2d269941fc 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -443,8 +448,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a38dd8d6242..5595cd48eee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1877,6 +1877,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.50.1

v3-0005-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v3-0005-NUMA-clockweep-partitioning.patchDownload

From 2bfd4a824b12e9a865c5ef0a8ed33e215fb1b698 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v3 5/7] NUMA: clockweep partitioning

Similar to the frelist patch - partition the "clocksweep" algorithm to
work on the sequence of smaller partitions, one by one.

It extends the "pg_buffercache_partitions" view to include information
about the clocksweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  24 +-
 src/backend/storage/buffer/bufmgr.c           | 476 ++++++++++--------
 src/backend/storage/buffer/freelist.c         | 224 +++++++--
 src/include/storage/buf_internals.h           |   4 +-
 src/include/storage/bufmgr.h                  |   5 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 478 insertions(+), 261 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 3871c261528..b7d8ea45ed7 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -12,7 +12,10 @@ LANGUAGE C PARALLEL SAFE;
 -- Create a view for convenient access.
 CREATE VIEW pg_buffercache_partitions AS
 	SELECT P.* FROM pg_buffercache_partitions() AS P
-	(partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer, buffers_consumed bigint, buffers_remain bigint, buffers_free bigint);
+	(partition integer,
+	 numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer,
+	 buffers_consumed bigint, buffers_remain bigint, buffers_free bigint,
+	 complete_passes bigint, buffer_allocs bigint, next_victim_buffer integer);
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 668ada8c47b..5169655ae78 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -818,6 +818,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "buffers_free",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "complete_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "buffer_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "next_victim_buffer",
+						   INT4OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -843,6 +849,10 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					buffers_remain,
 					buffers_free;
 
+		uint32		complete_passes,
+					buffer_allocs,
+					next_victim_buffer;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
@@ -850,7 +860,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
-								 &buffers_free);
+								 &buffers_free, &complete_passes,
+								 &buffer_allocs, &next_victim_buffer);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -876,6 +887,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[7] = Int64GetDatum(buffers_free);
 		nulls[7] = false;
 
+		values[8] = Int32GetDatum(complete_passes);
+		nulls[8] = false;
+
+		values[9] = Int32GetDatum(buffer_allocs);
+		nulls[9] = false;
+
+		values[10] = Int32GetDatum(next_victim_buffer);
+		nulls[10] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5922689fe5d..bd007c1c621 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3587,6 +3587,23 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
+ *
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
+ *
+ * XXX might be better to have a per-partition struct with all the info
+ */
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3602,55 +3619,24 @@ bool
 BgBufferSync(WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
-	int			strategy_buf_id;
-	uint32		strategy_passes;
 	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
-	/* Potentially these could be tunables, but for now, not */
-	float		smoothing_samples = 16;
-	float		scan_whole_pool_milliseconds = 120000.0;
-
-	/* Used to compute how far we scan ahead */
-	long		strategy_delta;
-	int			bufs_to_lap;
-	int			bufs_ahead;
-	float		scans_per_alloc;
-	int			reusable_buffers_est;
-	int			upcoming_alloc_est;
-	int			min_scan_buffers;
-
-	/* Variables for the scanning loop proper */
-	int			num_to_scan;
-	int			num_written;
-	int			reusable_buffers;
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
 
-	/* Variables for final smoothed_density update */
-	long		new_strategy_delta;
-	uint32		new_recent_alloc;
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
-	/*
-	 * Find out where the freelist clock sweep currently is, and how many
-	 * buffer allocations have happened since our last call.
-	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
 
 	/* Report buffer alloc counts to pgstat */
 	PendingBgWriterStats.buf_alloc += recent_alloc;
 
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -3663,223 +3649,285 @@ BgBufferSync(WritebackContext *wb_context)
 	}
 
 	/*
-	 * Compute strategy_delta = how many buffers have been scanned by the
-	 * clock sweep since last time.  If first time through, assume none. Then
-	 * see if we are still ahead of the clock sweep, and if so, how many
-	 * buffers we could scan before we'd catch up with it and "lap" it. Note:
-	 * weird-looking coding of xxx_passes comparisons are to avoid bogus
-	 * behavior when the passes counts wrap around.
-	 */
-	if (saved_info_valid)
-	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
-
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* info obtained from freelist.c */
+		int			strategy_buf_id;
+		uint32		strategy_passes;
+
+		/* Moving averages of allocation rate and clean-buffer density */
+		static float smoothed_alloc = 0;
+		static float smoothed_density = 10.0;
+
+		/* Potentially these could be tunables, but for now, not */
+		float		smoothing_samples = 16;
+		float		scan_whole_pool_milliseconds = 120000.0;
+
+		/* Used to compute how far we scan ahead */
+		long		strategy_delta;
+		int			bufs_to_lap;
+		int			bufs_ahead;
+		float		scans_per_alloc;
+		int			reusable_buffers_est;
+		int			upcoming_alloc_est;
+		int			min_scan_buffers;
+
+		/* Variables for the scanning loop proper */
+		int			num_to_scan;
+		int			num_written;
+		int			reusable_buffers;
+
+		/* Variables for final smoothed_density update */
+		long		new_strategy_delta;
+		uint32		new_recent_alloc;
+
+		/* buffer range for the clocksweep partition */
+		int			first_buffer;
+		int			num_buffers;
 
-		Assert(strategy_delta >= 0);
+		/*
+		 * Find out where the freelist clock sweep currently is, and how many
+		 * buffer allocations have happened since our last call.
+		 */
+		strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+											&first_buffer, &num_buffers);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		/*
+		 * Compute strategy_delta = how many buffers have been scanned by the
+		 * clock sweep since last time.  If first time through, assume none.
+		 * Then see if we are still ahead of the clock sweep, and if so, how
+		 * many buffers we could scan before we'd catch up with it and "lap"
+		 * it. Note: weird-looking coding of xxx_passes comparisons are to
+		 * avoid bogus behavior when the passes counts wrap around.
+		 */
+		if (saved_info_valid)
 		{
-			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
+
+			strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+			strategy_delta += (long) passes_delta * num_buffers;
+
+			Assert(strategy_delta >= 0);
+
+			if ((int32) (next_passes[partition] - strategy_passes) > 0)
+			{
+				/* we're one pass ahead of the strategy point */
+				bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta, bufs_to_lap);
+				elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta, bufs_to_lap);
 #endif
-		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
-		{
-			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			}
+			else if (next_passes[partition] == strategy_passes &&
+					 next_to_clean[partition] >= strategy_buf_id)
+			{
+				/* on same pass, but ahead or at least not behind */
+				bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
+#ifdef BGW_DEBUG
+				elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta, bufs_to_lap);
+#endif
+			}
+			else
+			{
+				/*
+				 * We're behind, so skip forward to the strategy point and
+				 * start cleaning from there.
+				 */
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta, bufs_to_lap);
+				elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta);
 #endif
+				next_to_clean[partition] = strategy_buf_id;
+				next_passes[partition] = strategy_passes;
+				bufs_to_lap = num_buffers;
+			}
 		}
 		else
 		{
 			/*
-			 * We're behind, so skip forward to the strategy point and start
-			 * cleaning from there.
+			 * Initializing at startup or after LRU scanning had been off.
+			 * Always start at the strategy point.
 			 */
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta);
+			elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
+				 strategy_passes, strategy_buf_id);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			strategy_delta = 0;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
-	}
-	else
-	{
-		/*
-		 * Initializing at startup or after LRU scanning had been off. Always
-		 * start at the strategy point.
-		 */
-#ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
-			 strategy_passes, strategy_buf_id);
-#endif
-		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
-	}
 
-	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+		/* Update saved info for next time */
+		prev_strategy_buf_id[partition] = strategy_buf_id;
+		prev_strategy_passes[partition] = strategy_passes;
+		/* FIXME has to happen after all partitions */
+		/* saved_info_valid = true; */
 
-	/*
-	 * Compute how many buffers had to be scanned for each new allocation, ie,
-	 * 1/density of reusable buffers, and track a moving average of that.
-	 *
-	 * If the strategy point didn't move, we don't update the density estimate
-	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
-	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
-		smoothed_density += (scans_per_alloc - smoothed_density) /
-			smoothing_samples;
-	}
+		/*
+		 * Compute how many buffers had to be scanned for each new allocation,
+		 * ie, 1/density of reusable buffers, and track a moving average of
+		 * that.
+		 *
+		 * If the strategy point didn't move, we don't update the density
+		 * estimate
+		 */
+		if (strategy_delta > 0 && recent_alloc_partition > 0)
+		{
+			scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
+			smoothed_density += (scans_per_alloc - smoothed_density) /
+				smoothing_samples;
+		}
 
-	/*
-	 * Estimate how many reusable buffers there are between the current
-	 * strategy point and where we've scanned ahead to, based on the smoothed
-	 * density estimate.
-	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
-	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
+		/*
+		 * Estimate how many reusable buffers there are between the current
+		 * strategy point and where we've scanned ahead to, based on the
+		 * smoothed density estimate.
+		 */
+		bufs_ahead = num_buffers - bufs_to_lap;
+		reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
-	/*
-	 * Track a moving average of recent buffer allocations.  Here, rather than
-	 * a true average we want a fast-attack, slow-decline behavior: we
-	 * immediately follow any increase.
-	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
-	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
-			smoothing_samples;
+		/*
+		 * Track a moving average of recent buffer allocations.  Here, rather
+		 * than a true average we want a fast-attack, slow-decline behavior:
+		 * we immediately follow any increase.
+		 */
+		if (smoothed_alloc <= (float) recent_alloc_partition)
+			smoothed_alloc = recent_alloc_partition;
+		else
+			smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
+				smoothing_samples;
 
-	/* Scale the estimate by a GUC to allow more aggressive tuning. */
-	upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
+		/* Scale the estimate by a GUC to allow more aggressive tuning. */
+		upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
 
-	/*
-	 * If recent_alloc remains at zero for many cycles, smoothed_alloc will
-	 * eventually underflow to zero, and the underflows produce annoying
-	 * kernel warnings on some platforms.  Once upcoming_alloc_est has gone to
-	 * zero, there's no point in tracking smaller and smaller values of
-	 * smoothed_alloc, so just reset it to exactly zero to avoid this
-	 * syndrome.  It will pop back up as soon as recent_alloc increases.
-	 */
-	if (upcoming_alloc_est == 0)
-		smoothed_alloc = 0;
+		/*
+		 * If recent_alloc remains at zero for many cycles, smoothed_alloc
+		 * will eventually underflow to zero, and the underflows produce
+		 * annoying kernel warnings on some platforms.  Once
+		 * upcoming_alloc_est has gone to zero, there's no point in tracking
+		 * smaller and smaller values of smoothed_alloc, so just reset it to
+		 * exactly zero to avoid this syndrome.  It will pop back up as soon
+		 * as recent_alloc increases.
+		 */
+		if (upcoming_alloc_est == 0)
+			smoothed_alloc = 0;
 
-	/*
-	 * Even in cases where there's been little or no buffer allocation
-	 * activity, we want to make a small amount of progress through the buffer
-	 * cache so that as many reusable buffers as possible are clean after an
-	 * idle period.
-	 *
-	 * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
-	 * the BGW will be called during the scan_whole_pool time; slice the
-	 * buffer pool into that many sections.
-	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+		/*
+		 * Even in cases where there's been little or no buffer allocation
+		 * activity, we want to make a small amount of progress through the
+		 * buffer cache so that as many reusable buffers as possible are clean
+		 * after an idle period.
+		 *
+		 * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many
+		 * times the BGW will be called during the scan_whole_pool time; slice
+		 * the buffer pool into that many sections.
+		 */
+		min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
-	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
-	{
+		if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
+		{
 #ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
-			 upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
+			elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
+				 upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
 #endif
-		upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
-	}
-
-	/*
-	 * Now write out dirty reusable buffers, working forward from the
-	 * next_to_clean point, until we have lapped the strategy scan, or cleaned
-	 * enough buffers to match our estimate of the next cycle's allocation
-	 * requirements, or hit the bgwriter_lru_maxpages limit.
-	 */
+			upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
+		}
 
-	num_to_scan = bufs_to_lap;
-	num_written = 0;
-	reusable_buffers = reusable_buffers_est;
+		/*
+		 * Now write out dirty reusable buffers, working forward from the
+		 * next_to_clean point, until we have lapped the strategy scan, or
+		 * cleaned enough buffers to match our estimate of the next cycle's
+		 * allocation requirements, or hit the bgwriter_lru_maxpages limit.
+		 */
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
-	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		num_to_scan = bufs_to_lap;
+		num_written = 0;
+		reusable_buffers = reusable_buffers_est;
 
-		if (++next_to_clean >= NBuffers)
+		/* Execute the LRU scan */
+		while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 		{
-			next_to_clean = 0;
-			next_passes++;
-		}
-		num_to_scan--;
+			int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
+												   wb_context);
 
-		if (sync_state & BUF_WRITTEN)
-		{
-			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 			{
-				PendingBgWriterStats.maxwritten_clean++;
-				break;
+				next_to_clean[partition] = first_buffer;
+				next_passes[partition]++;
+			}
+			num_to_scan--;
+
+			if (sync_state & BUF_WRITTEN)
+			{
+				reusable_buffers++;
+				if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
+				{
+					PendingBgWriterStats.maxwritten_clean++;
+					break;
+				}
 			}
+			else if (sync_state & BUF_REUSABLE)
+				reusable_buffers++;
 		}
-		else if (sync_state & BUF_REUSABLE)
-			reusable_buffers++;
-	}
 
-	PendingBgWriterStats.buf_written_clean += num_written;
+		PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
-	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
-		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
-		 bufs_to_lap - num_to_scan,
-		 num_written,
-		 reusable_buffers - reusable_buffers_est);
+		elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
+			 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
+			 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
+			 bufs_to_lap - num_to_scan,
+			 num_written,
+			 reusable_buffers - reusable_buffers_est);
 #endif
 
-	/*
-	 * Consider the above scan as being like a new allocation scan.
-	 * Characterize its density and update the smoothed one based on it. This
-	 * effectively halves the moving average period in cases where both the
-	 * strategy and the background writer are doing some useful scanning,
-	 * which is helpful because a long memory isn't as desirable on the
-	 * density estimates.
-	 */
-	new_strategy_delta = bufs_to_lap - num_to_scan;
-	new_recent_alloc = reusable_buffers - reusable_buffers_est;
-	if (new_strategy_delta > 0 && new_recent_alloc > 0)
-	{
-		scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
-		smoothed_density += (scans_per_alloc - smoothed_density) /
-			smoothing_samples;
+		/*
+		 * Consider the above scan as being like a new allocation scan.
+		 * Characterize its density and update the smoothed one based on it.
+		 * This effectively halves the moving average period in cases where
+		 * both the strategy and the background writer are doing some useful
+		 * scanning, which is helpful because a long memory isn't as desirable
+		 * on the density estimates.
+		 */
+		new_strategy_delta = bufs_to_lap - num_to_scan;
+		new_recent_alloc = reusable_buffers - reusable_buffers_est;
+		if (new_strategy_delta > 0 && new_recent_alloc > 0)
+		{
+			scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
+			smoothed_density += (scans_per_alloc - smoothed_density) /
+				smoothing_samples;
 
 #ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
-			 new_recent_alloc, new_strategy_delta,
-			 scans_per_alloc, smoothed_density);
+			elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
+				 new_recent_alloc, new_strategy_delta,
+				 scans_per_alloc, smoothed_density);
 #endif
+		}
+
+		/* hibernate if all partitions can hibernate */
+		hibernate &= (bufs_to_lap == 0 && recent_alloc_partition == 0);
 	}
 
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c3fbd651dd5..ff02dc8e00b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -63,17 +63,27 @@ typedef struct BufferStrategyFreelist
 #define MIN_FREELIST_PARTITIONS		4
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -83,6 +93,15 @@ typedef struct
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
 
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
@@ -99,6 +118,9 @@ typedef struct
 	int			num_partitions;
 	int			num_partitions_per_node;
 
+	/* clocksweep partitions */
+	ClockSweep *sweeps;
+
 	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
@@ -138,6 +160,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -149,6 +172,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -156,14 +180,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -189,19 +213,23 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
 }
 
 static int
@@ -258,6 +286,28 @@ calculate_partition_index()
 	return index;
 }
 
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
+}
+
 /*
  * ChooseFreeList
  *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
@@ -374,7 +424,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
@@ -445,13 +495,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Nothing on the freelist, so run the "clock sweep" algorithm
 	 *
-	 * XXX Should we also make this NUMA-aware, to only access buffers from
-	 * the same NUMA node? That'd probably mean we need to make the clock
-	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
-	 * subset of buffers. But that also means each process could "sweep" only
-	 * a fraction of buffers, even if the other buffers are better candidates
-	 * for eviction. Would that also mean we'd have multiple bgwriters, one
-	 * for each node, or would one bgwriter handle all of that?
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
 	 */
 	trycounter = NBuffers;
 	for (;;)
@@ -533,6 +587,41 @@ StrategyFreeBuffer(BufferDesc *buf)
 	SpinLockRelease(&freelist->freelist_lock);
 }
 
+/*
+ * StrategySyncStart -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			*num_buf_alloc += pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -540,37 +629,44 @@ StrategyFreeBuffer(BufferDesc *buf)
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -658,6 +754,10 @@ StrategyShmemSize(void)
 	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
 											num_partitions)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -676,6 +776,7 @@ StrategyInitialize(bool init)
 	int			num_nodes;
 	int			num_partitions;
 	int			num_partitions_per_node;
+	char	   *ptr;
 
 	/* */
 	BufferPartitionParams(&num_partitions, &num_nodes);
@@ -703,7 +804,8 @@ StrategyInitialize(bool init)
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
 						MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
-						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
+						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -718,12 +820,41 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* have to point the sweeps array to right after the freelists */
+		ptr = (char *) StrategyControl +
+			MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+			MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions);
+		StrategyControl->sweeps = (ClockSweep *) ptr;
+
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
+
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
@@ -771,7 +902,6 @@ StrategyInitialize(bool init)
 				buf->freeNext = freelist->firstFreeBuffer;
 				freelist->firstFreeBuffer = i;
 			}
-
 		}
 	}
 	else
@@ -1111,9 +1241,11 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 }
 
 void
-FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free)
+FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free,
+						 uint32 *complete_passes, uint32 *buffer_allocs, uint32 *next_victim_buffer)
 {
 	BufferStrategyFreelist *freelist;
+	ClockSweep *sweep;
 	int			cur;
 
 	/* stats */
@@ -1123,6 +1255,7 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
 	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
 
 	freelist = &StrategyControl->freelists[idx];
+	sweep = &StrategyControl->sweeps[idx];
 
 	/* stat */
 	SpinLockAcquire(&freelist->freelist_lock);
@@ -1152,4 +1285,11 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
 
 	*remain = cnt_remain;
 	*actually_free = cnt_free;
+
+	/* get the clocksweep stats too */
+	*complete_passes = sweep->completePasses;
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9dfbecb9fe4..907b160b4f7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,7 +449,9 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index df127274190..53855d4be23 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -349,7 +349,10 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void FreelistPartitionGetInfo(int idx,
 									 uint64 *consumed, uint64 *remain,
-									 uint64 *actually_free);
+									 uint64 *actually_free,
+									 uint32 *complete_passes,
+									 uint32 *buffer_allocs,
+									 uint32 *next_victim_buffer);
 
 /* inline functions */
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c695cfa76e8..a38dd8d6242 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.50.1

v3-0004-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v3-0004-NUMA-partition-buffer-freelist.patchDownload

From 06a43d54498ab5049c12a458cfbd4fe3b3b168c2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v3 4/7] NUMA: partition buffer freelist

Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.

This uses the buffer partitioning scheme introduced by the earlier
patch, i.e. the partitions will "align" with NUMA nodes, etc.

It also extends the "pg_buffercache_partitions" view, to include
information about each freelist (number of consumedd buffers, ...).

When allocating a buffer, it's taken from the correct freelist (same
NUMA node).

Note: This is (probably) more important than partitioning ProcArray.
---
 .../pg_buffercache--1.6--1.7.sql              |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  24 +-
 src/backend/storage/buffer/freelist.c         | 360 ++++++++++++++++--
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/include/miscadmin.h                       |   1 +
 src/include/storage/bufmgr.h                  |   4 +-
 7 files changed, 372 insertions(+), 30 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index bd97246f6ab..3871c261528 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -12,7 +12,7 @@ LANGUAGE C PARALLEL SAFE;
 -- Create a view for convenient access.
 CREATE VIEW pg_buffercache_partitions AS
 	SELECT P.* FROM pg_buffercache_partitions() AS P
-	(partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer);
+	(partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer, buffers_consumed bigint, buffers_remain bigint, buffers_free bigint);
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8baa7c7b543..668ada8c47b 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -812,6 +812,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "buffers_consumed",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "buffers_remain",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "buffers_free",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -833,12 +839,19 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffers_consumed,
+					buffers_remain,
+					buffers_free;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
+								 &buffers_free);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -854,6 +867,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
+		values[5] = Int64GetDatum(buffers_consumed);
+		nulls[5] = false;
+
+		values[6] = Int64GetDatum(buffers_remain);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffers_free);
+		nulls[7] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..c3fbd651dd5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,52 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+	/* Spinlock: protects the values below */
+	slock_t		freelist_lock;
+
+	/*
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres.
+	 */
+	int			firstFreeBuffer __attribute__((aligned(64)));	/* Head of list of
+																 * unused buffers */
+
+	/* Number of buffers consumed from this list. */
+	uint64		consumed;
+}			BufferStrategyFreelist;
+
+/*
+ * The minimum number of partitions we want to have. We want at least this
+ * number of partitions, even on non-NUMA system, as it helps with contention
+ * for buffers. But with multiple NUMA nodes, we want a separate partition per
+ * node. But we may get multiple partitions per node, for low node count.
+ *
+ * With multiple partitions per NUMA node, we pick the partition based on CPU
+ * (or some other parameter).
+ */
+#define MIN_FREELIST_PARTITIONS		4
 
 /*
  * The shared freelist control information.
@@ -39,8 +77,6 @@ typedef struct
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
-	int			firstFreeBuffer;	/* Head of list of unused buffers */
-
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
@@ -51,8 +87,19 @@ typedef struct
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
+	 *
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres. Also, shouldn't the alignment be specified after, like for
+	 * "consumed"?
 	 */
-	int			bgwprocno;
+	int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -157,6 +204,88 @@ ClockSweepTick(void)
 	return victim;
 }
 
+static int
+calculate_partition_index()
+{
+	int			rc;
+	unsigned	cpu;
+	unsigned	node;
+	int			index;
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+	 * list based on that.
+	 */
+	rc = getcpu(&cpu, &node);
+	if (rc != 0)
+		elog(ERROR, "getcpu failed: %m");
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
+}
+
+/*
+ * ChooseFreeList
+ *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->freelists[index];
+}
+
 /*
  * have_free_buffer -- a lockless check to see if there is a free buffer in
  *					   buffer pool.
@@ -168,10 +297,13 @@ ClockSweepTick(void)
 bool
 have_free_buffer(void)
 {
-	if (StrategyControl->firstFreeBuffer >= 0)
-		return true;
-	else
-		return false;
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+			return true;
+	}
+
+	return false;
 }
 
 /*
@@ -193,6 +325,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	BufferStrategyFreelist *freelist;
 
 	*from_ring = false;
 
@@ -259,31 +392,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
 	 * manipulate them without holding the spinlock.
 	 */
-	if (StrategyControl->firstFreeBuffer >= 0)
+	freelist = ChooseFreeList();
+	if (freelist->firstFreeBuffer >= 0)
 	{
 		while (true)
 		{
 			/* Acquire the spinlock to remove element from the freelist */
-			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			SpinLockAcquire(&freelist->freelist_lock);
 
-			if (StrategyControl->firstFreeBuffer < 0)
+			if (freelist->firstFreeBuffer < 0)
 			{
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				SpinLockRelease(&freelist->freelist_lock);
 				break;
 			}
 
-			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+			buf = GetBufferDescriptor(freelist->firstFreeBuffer);
 			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
 			/* Unconditionally remove buffer from freelist */
-			StrategyControl->firstFreeBuffer = buf->freeNext;
+			freelist->firstFreeBuffer = buf->freeNext;
 			buf->freeNext = FREENEXT_NOT_IN_LIST;
 
+			/* increment number of buffers we consumed from this list */
+			freelist->consumed++;
+
 			/*
 			 * Release the lock so someone else can access the freelist while
 			 * we check out this buffer.
 			 */
-			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			SpinLockRelease(&freelist->freelist_lock);
 
 			/*
 			 * If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +442,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		}
 	}
 
-	/* Nothing on the freelist, so run the "clock sweep" algorithm */
+	/*
+	 * Nothing on the freelist, so run the "clock sweep" algorithm
+	 *
+	 * XXX Should we also make this NUMA-aware, to only access buffers from
+	 * the same NUMA node? That'd probably mean we need to make the clock
+	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+	 * subset of buffers. But that also means each process could "sweep" only
+	 * a fraction of buffers, even if the other buffers are better candidates
+	 * for eviction. Would that also mean we'd have multiple bgwriters, one
+	 * for each node, or would one bgwriter handle all of that?
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -356,7 +503,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 void
 StrategyFreeBuffer(BufferDesc *buf)
 {
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	BufferStrategyFreelist *freelist;
+
+	/*
+	 * We don't want to call ChooseFreeList() again, because we might get a
+	 * completely different freelist - either a different partition in the
+	 * same group, or even a different group if the NUMA node changed. But we
+	 * can calculate the proper freelist from the buffer id.
+	 */
+	int			index = (BufferGetNode(buf->buf_id) * StrategyControl->num_partitions_per_node)
+		+ (buf->buf_id % StrategyControl->num_partitions_per_node);
+
+	Assert((index >= 0) && (index < StrategyControl->num_partitions));
+
+	freelist = &StrategyControl->freelists[index];
+
+	SpinLockAcquire(&freelist->freelist_lock);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -364,11 +526,11 @@ StrategyFreeBuffer(BufferDesc *buf)
 	 */
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
-		buf->freeNext = StrategyControl->firstFreeBuffer;
-		StrategyControl->firstFreeBuffer = buf->buf_id;
+		buf->freeNext = freelist->firstFreeBuffer;
+		freelist->firstFreeBuffer = buf->buf_id;
 	}
 
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+	SpinLockRelease(&freelist->freelist_lock);
 }
 
 /*
@@ -432,6 +594,42 @@ StrategyNotifyBgWriter(int bgwprocno)
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+	for (int p = 0; p < StrategyControl->num_partitions; p++)
+	{
+		BufferStrategyFreelist *freelist = &StrategyControl->freelists[p];
+		uint64		remain = 0;
+		uint64		actually_free = 0;
+		int			cur = freelist->firstFreeBuffer;
+
+		while (cur >= 0)
+		{
+			uint32		local_buf_state;
+			BufferDesc *buf;
+
+			buf = GetBufferDescriptor(cur);
+
+			remain++;
+
+			local_buf_state = LockBufHdr(buf);
+
+			if (!(local_buf_state & BM_TAG_VALID))
+				actually_free++;
+
+			UnlockBufHdr(buf, local_buf_state);
+
+			cur = buf->freeNext;
+		}
+		elog(LOG, "NUMA: freelist partition %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+			 p,
+			 freelist->firstFreeBuffer,
+			 freelist->consumed,
+			 remain, actually_free);
+	}
+}
 
 /*
  * StrategyShmemSize
@@ -445,12 +643,20 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+	int			num_nodes;
+
+	BufferPartitionParams(&num_partitions, &num_nodes);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
-	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+	size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+	/* size of freelist partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+											num_partitions)));
 
 	return size;
 }
@@ -467,6 +673,18 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	/* */
+	BufferPartitionParams(&num_partitions, &num_nodes);
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -484,7 +702,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
 						&found);
 
 	if (!found)
@@ -494,13 +713,10 @@ StrategyInitialize(bool init)
 		 */
 		Assert(init);
 
-		SpinLockInit(&StrategyControl->buffer_strategy_lock);
+		/* register callback to dump some stats on exit */
+		before_shmem_exit(freelist_before_shmem_exit, 0);
 
-		/*
-		 * Grab the whole linked list of free buffers for our strategy. We
-		 * assume it was previously set up by BufferManagerShmemInit().
-		 */
-		StrategyControl->firstFreeBuffer = 0;
+		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +727,52 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
+
+		/*
+		 * Rebuild the freelist - right now all buffers are in one huge list,
+		 * we want to rework that into multiple lists. Start by initializing
+		 * the strategy to have empty lists.
+		 */
+		for (int nfreelist = 0; nfreelist < num_partitions; nfreelist++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			BufferStrategyFreelist *freelist;
+
+			freelist = &StrategyControl->freelists[nfreelist];
+
+			freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+			SpinLockInit(&freelist->freelist_lock);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(nfreelist, &node,
+							   &num_buffers, &first_buffer, &last_buffer);
+
+			/*
+			 * Walk through buffers for each partition, add them to the list.
+			 * Walk from the end, because we're adding the buffers to the
+			 * beginning.
+			 */
+
+			for (int i = last_buffer; i >= first_buffer; i--)
+			{
+				BufferDesc *buf = GetBufferDescriptor(i);
+
+				/* add to the freelist */
+				buf->freeNext = freelist->firstFreeBuffer;
+				freelist->firstFreeBuffer = i;
+			}
+
+		}
 	}
 	else
 		Assert(!init);
@@ -847,3 +1109,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free)
+{
+	BufferStrategyFreelist *freelist;
+	int			cur;
+
+	/* stats */
+	uint64		cnt_remain = 0;
+	uint64		cnt_free = 0;
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	freelist = &StrategyControl->freelists[idx];
+
+	/* stat */
+	SpinLockAcquire(&freelist->freelist_lock);
+
+	*consumed = freelist->consumed;
+
+	cur = freelist->firstFreeBuffer;
+	while (cur >= 0)
+	{
+		uint32		local_buf_state;
+		BufferDesc *buf;
+
+		buf = GetBufferDescriptor(cur);
+
+		cnt_remain++;
+
+		local_buf_state = LockBufHdr(buf);
+
+		if (!(local_buf_state & BM_TAG_VALID))
+			cnt_free++;
+
+		UnlockBufHdr(buf, local_buf_state);
+
+		cur = buf->freeNext;
+	}
+	SpinLockRelease(&freelist->freelist_lock);
+
+	*remain = cnt_remain;
+	*actually_free = cnt_free;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..a11bc71a386 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int			MaxBackends = 0;
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
+bool		numa_partition_freelist = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a21f20800fb..0552ed62cc7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2136,6 +2136,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_partition_freelist", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+			gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+		},
+		&numa_partition_freelist,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..66baf2bf33e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT bool numa_partition_freelist;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index deaf4f19fa4..df127274190 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -347,7 +347,9 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
-
+extern void FreelistPartitionGetInfo(int idx,
+									 uint64 *consumed, uint64 *remain,
+									 uint64 *actually_free);
 
 /* inline functions */
 
-- 
2.50.1

v3-0003-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v3-0003-freelist-Don-t-track-tail-of-a-freelist.patchDownload

From 3c7dbbec4ee3957c92bf647605768beb5473f66b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v3 3/7] freelist: Don't track tail of a freelist

The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
 src/backend/storage/buffer/freelist.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
 	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
-	int			lastFreeBuffer; /* Tail of list of unused buffers */
-
-	/*
-	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
-	 * when the list is empty)
-	 */
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
 		buf->freeNext = StrategyControl->firstFreeBuffer;
-		if (buf->freeNext < 0)
-			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}
 
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
 		 * assume it was previously set up by BufferManagerShmemInit().
 		 */
 		StrategyControl->firstFreeBuffer = 0;
-		StrategyControl->lastFreeBuffer = NBuffers - 1;
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
-- 
2.50.1

v3-0002-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v3-0002-NUMA-localalloc.patchDownload

From d08a94b0f54a4c4986d351f98269905bb511624c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v3 2/7] NUMA: localalloc

Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).

This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.

XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/init/miscinit.c   | 17 +++++++++++++++++
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 29 insertions(+)

diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int			MaxBackends = 0;
 
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
+bool		numa_localalloc = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 43b4dbccc3d..079974944e9 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
 #include <arpa/inet.h>
 #include <utime.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
@@ -164,6 +168,19 @@ InitPostmasterChild(void)
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
 #endif
+
+#ifdef USE_LIBNUMA
+
+	/*
+	 * Set the default allocation policy to local node, where the task is
+	 * executing at the time of a page fault.
+	 *
+	 * XXX I believe this is not necessary, now that we don't use automatic
+	 * interleaving (numa_set_interleave_mask).
+	 */
+	if (numa_localalloc)
+		numa_set_localalloc();
+#endif
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9570087aa60..a21f20800fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables setting the default allocation policy to local node."),
+			gettext_noop("When enabled, allocate from the node where the task is executing."),
+		},
+		&numa_localalloc,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.50.1

#49

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Tomas Vondra (#48)

Re: Adding basic NUMA awareness

On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

just a quick look here:

2) The PGPROC part introduces a similar registry, [..]

There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.

If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.

3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".

Thanks!

simple benchmark
----------------

[..]

There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").

Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.

Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.

For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):

mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%

The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).

Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)

[..]

To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.

Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?

reserving number of huge pages
------------------------------

[..]

It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See

$ numastat -cm

Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.

Yup, similiar story as with OOMs just for per-zone/node.

And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.

I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flag

Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.

The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.

I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.

Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages still works OK to
establish the exact count for vm.nr_hugepages, right?

Regards,
-J.

#50

tomas@vondra.me

6 months ago

In reply to: Jakub Wartak (#49)

Re: Adding basic NUMA awareness

On 7/30/25 10:29, Jakub Wartak wrote:

On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

just a quick look here:

2) The PGPROC part introduces a similar registry, [..]

There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.

If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.

3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".

Thanks!

simple benchmark
----------------

[..]

There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").

Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.

The pgbench pinning certainly reduces some of the noise / overhead you
get when using multiple machines. I use it to "isolate" patches, and
make the effects more visible.

Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.

Yes, that's possible. The simple pgbench workload is a bit of a "worst
case" for the NUMA patches, in that it's can benefit less from the
improvements, and it's also fairly sensitive to regressions.

I plan to do more tests with other types of workloads, like the one
doing a lot of large sequential scans, etc.

For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):

mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%

The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).

Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)

Maybe I don't understand the question, but the last column (pinning)
compares two builds.

1) Build with all the patches up to "pgproc interleaving" (and all of
the GUCs set to "on").

2) Build with all the patches from (1), and "pinning" too (again, all
GUCs set to "on).

Or do I misunderstand the question?

[..]

To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.

Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?

Thanks. I think we'll try something like (1), but it'll need to be a bit
more elaborate, because scans on tables larger than 1/4 shared buffers
use a small circular buffer.

reserving number of huge pages
------------------------------

[..]

It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See

$ numastat -cm

Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.

Yup, similiar story as with OOMs just for per-zone/node.

And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.

I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flag

Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.

I think I tried using numa_set_strict, but it didn't change the behavior
(i.e. the numa_tonode_memory didn't error out).

The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.

I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.

Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages still works OK to
establish the exact count for vm.nr_hugepages, right?

Well, yes and no. It tells you the exact number of huge pages, but it
does not tell you how much you need to inflate it to account for the
non-shared buffer part that may get allocated on a random node.

regards

--
Tomas Vondra

#51

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#50)

16 attachment(s)

Re: Adding basic NUMA awareness

Hi,

Here's an updated version of the patch series. The main improvement is
the new 0006 patch, adding "adaptive balancing" of allocations. I'll
also share some results from a workload doing a lot of allocations.

adaptive balancing of allocations
---------------------------------

Imagine each backend only allocates buffers from the partition on the
same NUMA node. E.g. you have 4 NUMA nodes (i.e. 4 partitions), and a
backend only allocates buffers from "home" partition (on the same NUMA
node). This is what the earlier patch versions did, and with many
backends that's mostly fine (assuming the backends get spread over all
the NUMA nodes).

But if there's only few backends doing the allocations, this can result
in very inefficient use of shared buffers - a single backend would be
limited to 25% of buffers, even if the rest is unused.

There needs to be some say to "redirect" excess allocations to other
partitions, so that the partitions are utilized about the same. This is
what the 0006 patch aims to do (I kept is separate, but it should
probably get merged into the "clocksweep partitioning" in the end).

The balancing is fairly simple:

(1) It tracks the number of allocations "requested" from each partition.

(2) In regular intervals (by bgwriter) calculate the "fair share" per
partition, and determine what fraction of "requests" to handle from the
partition itself, and how many to redirect to other partitions.

(3) Calculate coefficients to drive this for each partition.

I emphasize (1) talks about "requests", not the actual allocations. Some
of the requests could have been redirected to different partitions, and
be counted as allocations there. We want to balance allocations, but it
relies on the requests.

To give you a simple example - imagine there are 2 partitions with this
number of allocation requests:

P1: 900,000 requests
P2: 100,000 requests

This means the "fair share" is 500,000 allocations, so P1 needs to
redirect some requests to P2. And we end with these weights:

P1: [ 55, 45]
P2: [ 0, 100]

Assuming the workload does not shift in some dramatic way, this should
result in both partitions handling ~500k allocations.

It's not hard to extend this algorithm to more partitions. For more
details see StrategySyncBalance(), which recalculates this.

There are a couple open questions, like:

* The algorithm combines the old/new weights by averaging, to add a bit
of hysteresis. Right now it's a simple average with 0.5 weight, to
dampen sudden changes. I think it works fine (in the long run), but I'm
open to suggestions how to do this better.

* There's probably additional things we should consider when deciding
where to redirect the allocations. For example, we may have multiple
partitions per NUMA node, in which case it's better to redirect to that
node as many allocations as possible. The current patch ignores this.

* The partitions may have slightly different sizes, but the balancing
ignores that for now. This is not very difficult to address.

clocksweep benchmark
--------------------

I ran a simple benchmark focused on allocation-heavy workloads, namely
large concurrent sequential scans. The attached scripts generate a
number of 1GB tables, and then run concurrent sequential scans with
shared buffers set to 60%, 75%, 90% and 110% of the total dataset size.

I did this for master, and with the NUMA patches applied (and the GUCs
set to 'on'). I also increased tried with the of partitions increased to
16 (so a NUMA node got multiple partitions).

There are results from three machines

1) ryzen - small non-NUMA system, mostly to see if there's regressions

2) xeon - older 2-node NUMA system

3) hb176 - big EPYC system with 176 cores / 4 NUMA nodes

The script records detailed TPS stats (e.g. percentiles), I'm attaching
CSV files with complete results, and some PDFs with charts summarizing
that (I'll get to that in a minute).

For the EPYC, the average tps for the three builds looks like this:

clients | master numa numa-16 | numa numa-16
----------|--------------------------------|---------------------
8 | 20 27 26 | 133% 129%
16 | 23 39 45 | 170% 193%
24 | 23 48 58 | 211% 252%
32 | 21 57 68 | 268% 321%
40 | 21 56 76 | 265% 363%
48 | 22 59 82 | 270% 375%
56 | 22 66 88 | 296% 397%
64 | 23 62 93 | 277% 411%
72 | 24 68 95 | 277% 389%
80 | 24 72 95 | 295% 391%
88 | 25 71 98 | 283% 392%
96 | 26 74 97 | 282% 369%
104 | 26 74 97 | 282% 367%
112 | 27 77 95 | 287% 355%
120 | 28 77 92 | 279% 335%
128 | 27 75 89 | 277% 328%

That's not bad - the clocksweep partitioning increases the throughput
2-3x. Having 16 partitions (instead of 4) helps yet a bit more, to 3-4x.

This is for shared buffers set to 60% of the dataset, which depends on
the number of clients / tables. With 64 clients/tables, there's 64GB of
data, and shared buffers are set to ~39GB.

The results for 75% and 90% follow the same pattern. For 110% there's
much less impact - there are no allocations, so this has to be thanks to
the other NUMA patches.

The charts in the attached PDFs add a bit more detail, with various
percentiles (of per-second throughput). The bands are roughly quartiles:
5-25%, 25-50%, 50-75%, 75-95%. The thick middle line is the median.

There's only charts for 60%, 90% and 110% shared buffers, for fit it on
a single page. There 75% is not very different.

For ryzen there's little difference. Not surprising, it's not a NUMA
system. So this is positive result, as there's no regression.

For xeon the patches help a little bit. Again, not surprising. It's a
fairly old system (~2016), and the differences between NUMA nodes are
not that significant.

For epyc (hb176), the differences are pretty massive.

regards

--
Tomas Vondra

Attachments:

numa-benchmark-ryzen.pdfapplication/pdf; name=numa-benchmark-ryzen.pdfDownload

%PDF-1.7
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x�-�1�0Cw�� �	��{�L�+D��*�p|�)�d��V
��R������k[����',�-|��yt�����1���x��	�DF�n3��%���?2?TT�-��
endstream
endobj
5 0 obj
   110
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /XObject << /x7 7 0 R >>
>>
endobj
7 0 obj
<< /Length 10 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Form
   /BBox [ 0 0 1754 1241 ]
   /Resources 9 0 R
>>
stream
x���Kn!����(���������Y%�$���LHR*V6���'7�W���l_f����� �����c��FY��#,N2��z�u��nq�*�,�E�<�bP�@R����.�7��qM�X�$��FD,]?D1y��"2�0�f�y0��A�J)4��_A5��B�*c��$�R�1xL��	}aUE$�d��gzm��Q�����o�����Vy]Z���g���
endstream
endobj
10 0 obj
   233
endobj
9 0 obj
<<
   /ExtGState <<
      /gs0 << /BM /Normal /SMask /None /CA 1.0 /ca 1.0 >>
      /a0 << /CA 1 /ca 1 >>
   >>
   /XObject << /x11 11 0 R /x12 12 0 R /x13 13 0 R /x14 14 0 R /x15 15 0 R /x16 16 0 R /x17 17 0 R /x18 18 0 R /x19 19 0 R >>
>>
endobj
11 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yP��?~Wd[�j�j�����qv����������R���k7N��w�1�8�����l+���!�]���[�����/qb��k���������<rg��m
�W��������?t��=��B�d�;???��|O_E�cb}��=����vCC�����y<������o ��p���K�.yxx�x������q��Zm^^��3g���


4
=�h4feey{{����L�
�����O�^]]ek�`iii���(++���l6�}}}��~�������Eyyyegg�������=���b0"""�����966Ff0���g

��A���������������������===E��b.6�_YYEQL�A�\NQTSS��~����E���n��
T^^NQ��K�FGG�z���\\\EQd�LFQ��������z�L&����(�>q�����4�:--���s�77����z����U�K�����*���J��~�������tv��d���=y�������h���*++I3==�����z��7oR���C�nnn%%%�uII���=gee%EQ���lmp��NWVV������O�:������gff������N�:�����j�/Y]]-..�t���������s����u:�j2�����������������R���h��vxx822������3$$D&���"{������8�����6$<<���+6'm�J�Z������y��~�����]n�Vk���KJJ�����=�����z�J����8}����gdd��������Ldd������_ff�J��0���X���?����LOO�������
OOO_\\�j+��600�y���q��BBB�yg���+�����s����+,,$����y<y������E*�Avz�.]���.,,�(*))���'..������3Eeff���F#���������={��(��\AAEQ/^,,,���9}�4EQ���d�FQTXX�����vss�������������(��SdO���s��������r�[����������������������%��r�����F��|���������h��._�LWO�V{��Er��������d������^�r���(88���El_7�,v�����t��)777�@P\\|��u����=k�������a�D"�(*++K.�_�~���������B����CQ�Z��{�j5YO����


%�CCCcbb����D���S
�@#����<���#�[�lmm�(��������9�����I����L�$�����3o�������S
����������x<2������K��m����(���W�loff&�"��|��Uzo�����B���]�X,����N������ok2�H���:a4�j���PUUUe=���lx��x^t?���Z���,Y+�Dbs+��x����������.���������EQ�m!L&EQ�����qoo�������S�NMMMY,������B�]��`7�ndnn�4�F#�YZZ��c�����[ZZ�g �Keee����6�i������b����(�����?))i��������26�n�J�yQ�������
���^�k����+��bQ*�EY�`}jpppuphh���(�����3X�����t�b19�ab�	}�6##cqq�h4���Y�9y�$EQ�K7��E�<y��Y\\������[^^&���.]


2�LF�133�\#�yW��W�����X�F6�Q�ZmyyyHH9E#S����?�'O�		�y��������555[��y��������������_%�8������~{m�!�����'���ZBj4��A����j�]�3X�����tt:9�&��iii]]]��]��� �b}}�>�'o��Mz��������%������L��|W���{��{��+W�^.>>�����f���������s�����'Offf�B�aOK����l����
���g��Y���������[RR�y����e���{���������[�`�%�������c4�Ritt4}������[���~��xs�d������L�� G��r����G�nsW��{���zYpp0}ruu����<���\uu��S�(�jkk���lhh�s����A�HDQ�����l��U���q{m�!Ob}�u2��z ��������W��������2//����}�t�z}OO#t������=��������+++��`�I�q������(��������E��m���t�u�#�S��e�����CU�����o�=gii)�4==M�`��	���������������[��Ol�J���;n��5LHH�(����\,++�Z���)�����g G:�Z�����{�������EQ=���s�up?����
W��Z������L&��4����b��H��������z���������������m��{u�u����tLMM�t������?��|���7�Lccc���^UWWS����������h�B������}||���;�A�����222����Wi�:x��������nnn���---�avv�<���mbb����������`0�������~p�SS:::������I��X,d�			J�ree%;;�������~:j�����<yR*��1�ie�������������Z\\EN��7������|~QQ����e}�L�L�+W�\�p�z��VwU����:h2������VUUE�
���z���yxx����/�������T������K��c$�����[�~����w����������i����`ss���'=���du���w///r��#�����{=/���Nww��aBW�^�\��GMM��}^^^---�3H$�<<<��9�k}���n���������1�L����������XgggYY����GFFjkk+**$	�N������!N�����Igkk��Y8r�V7�#���2�O�i�U������n��*�J*�VTT���Y?����T___UU5<<lsM�A�<22RUU�����}&U����������h4nX��kns[v��Q��MMM���555L�z���,�+**m�����,�***D"��w�
���C�6�UL������^G���n��z�A*�FX���JKKdmL&SCCClllTTTvv���<�j����f�H$���������������������l6�K������U(���%�:��������������@,����������d��p��l����������h0������I����h4fgg�D"�8�J��?��
u��u�Vkk�����G���������	�_�?S~jj������kuu��L����mmm���P�T�W >>~ii�nVWW�����l����o�M���fff��'�/!!app�d2����5���:h4srrz{{rss�������J�:x��������N��"�RSSE"���7�E�T*222�bqyy��7VVV6�����M&������w9�V�qqqt-�X,uuud%��'������%I~~���;&�J���H]�H$uuu� //�h4Z,����RI�����><4yyy��iaa!--�>�900P\\L�yff&]S***���6�@VV���ym4322����r���###�Ycc���������h�Z	����b�8%%%))ill�1"�[[[-���lJJ
�o2�JJJ�B!ivtt������b�8&&�|s���d�9���������Z�X���������|��l���8��w[�Z�6---77�l2���Eu:]zzzWW=uvv633������fsUU���7���������������6�EZ,�R��N�TVWW�����<6����5
��uuud���'U\\�����h����r�����`�P����bccggg����������X��������[X������������z���6��III�C_��g����B�B���;99�M?9V�o����NHH�O��\�����/�544���
�ljj***"Wi:�.!!���E��k�Z�DRPPp�u������^������D�����+��<[����dff��#:;;�����l�?<<���d}��������a3��l�Mf��������4���#""�2��/_XX(((������)//'CM�����������7n�����ww3�V�&�I(���EEE������6�ccc7��o6�KJJjjj��
����j||���V��������"��ZZZ����^�=mll�!+�3,,����o�������/�<urr���#O?�txx8���������u�\.wrr�H$J���?<v����a�R��D�=����,���_~������K�"""�J��bINNvuu%Se2��C��mNuvv���'�����@ `};v�����~��yxx����}���j����=<<�T�F��~��mN=p��V�%�nnn�o�N��1��?�#66������������[�os��}��F#����>s���MLLH�511�p��������g�|>}MP�P8::Z�cs������"����/���Y_w��x��'e2y�������������k��������z~�S�x��������������;t���C�yzz�������u���G�=q�����rs9�yjbb�����'�y��C��7�nUWW���'$$�t:�3;;H��u��T��RQQ����P(���e���7o�$�
Eddd```CC��9mN������ZZZbq���:�#�K{�IsxA�N�IWW���>���c�X��C��=z��qGG���T�9mN���w�?�������C&���M��u]�R����9��Q�5�V0��\3��\o:���d2��'?y���I���9|�0������[oY�ls���Kii��b1��/������=hzE>8��;��7�9� Z�
�w3���Z���g��z�Fhh���G}�Y,>���12iaa�������6�:::*
�����_�v�����i����tO7�:�Z���|!/�=��7�� ���z����1��9s�����F��[�����6�����>��������v|�ZB�B^�D�u�e�V�����l�������K��:x��92U��8p�z~�S�Z-�tww�p���MLLH��qNzCL�(�L\�r��e\o=�MLL0_���C���G ������������������<00@:?���@��V|3���Y>$[�X�Z���7�ABNMM=����D�T~����~���l6�~��g�|��R��H$���r�������-����B^�����p���&����������NNN���?r��������������<����CCC9��o
���-��{2��B�}���HOaO2_�c�9��]P�����&#��=p��q2��G������~��{������L+������E�����N�������y-��y�F��n6��-����,n(������\E�[a�}6��q��)-�C�=C�����W��BV��f�,�/�t'Y,f����o��%�����VM������Z�A�hSk��I/G6�g�����.���-�`��d8����D���&�FL&��dLk�`�i4��S�S��l���<��k�D{p�����w�=^z%��_����ZB�ir�����I��x�9������t8Z�Fk���f��\���d�`��:#E�|!�����t����E/�M��<����/<����<�~��Q��\�����
b^��i��'7����m������m��u�h�T�����6��uQ`�-����G�N�k�)��,W#�Q��1%�y��������7��cwQ����j�����9Na-����z��l<88�������NF���q2���j�rY���
W4��4`8��C�����+�E�n����dX�(�nN.����z0Na,�KCIM����V
&����u��&>��~pU�K�HH�iAK���\�y�6u�����I���6!��E��
�u�����L7���&����z�������E�������{L����Xh"���i�.Az�n���=
����:� �n�z��/�U��p�Gx��"�D6��7���SU�� ��G�TS\�NH��bA��A����%H�6���z���AD;���� ��A0N`�P�>���;�:� ��A��AD;H��8��jb�U�^M��;��R�g��9sM;�:� v����8�W���N�5hf�|!/^����a���+��C���\��vuA�#��6�h%y�5�'����tS8T����������:� ��30�U��5��7�M\��vu��tLI$�5]��,4����_F�0�A���`�P�L�l{�LKNg_�+��w�����h����#���R�l�bu��UT���M�]�7���|7y�=�,����0�l���u%��E�w'&5���RZ����nF4������p��#v���d��W�����5zu��M�Qk6���U�w��Y\��/
)��,7��^���s]i�%kGL��3-��t����K��n�I/�
�r:��8�;N�x]F{Ty_i�F*���L7e�mm�������jA=�5���Z�^]������5��������z���:X"K�y�E�@���]$5�Yh�I/�v��_3;�k��d3������i��B^a�`��F�|ihry���mN,����]4����
��]����Mj�����7��f�L3W[���V��:Hi�����D7��y��4��G���z���U�zA=�������������<�G�W����[g8���Y`����������?����f����K�H�k8����f����im�,4����z&.�1��*�J�l'o��A�hU�H����FD#��49�^�������b��v����?��}���M��:Z���Z&n�O5�O5���f��1%����Vq2�1�36��|VG��	uA��t+�?:�h2j
�\m������������x��fZ[$_�������\�y��{�� �� ���~�,Vr)�3��E�������9�M�hU��'�#�����:X$Ka���|-Xx��'��e���}|����*�����A�2]�M
g���U��N�1%�y�I*����t�6����k�������������������r���+����������d�k���~/����/�m)�������&���tM75���k�xmZ[dE_6������-�6u����
��].�L�!
�Km&H�T�gs^�X���oR�he�Tc��+��W7TB�������W��JPJVGL��|��+v�7$���8/L&^�U��<}�$������~�1��O�5���_;::^z��Gy���^�J��3,,����o�������/�<urr���#O?�txx��q[�N^W�d�&B���$���P��f�xy�� ����/������h������-�\.wrr�H$J���?<v����6��a�R��D�=�������"L��dmLrNgk�bYj�\��7�fU��px^t~~����X,�������S&����?����Tgg���~��������u��ZB"�+�~83� ����������!!!����� ��f�����<���Z�������l^� bIn	�y�S	zg�k�n����e��\= A�Wu022��/�|������x<�7�?����6�����h4�Noo�3g�l^Dvk\bC��S���/�7���R��B^x��|QZ�(��f��'�!�L\� v�!y?�EoK}����W�|>}MP�P8::Z�cs������"����/���7s"��������9�t3�=zp��k��v��q���#�ZA������� �������������k��������z~�S�x������������n^� ��n�G���>�4��Z�DC�D��)� ��I���~���<==�����G?Z^^6����G�=q�����rs9�yjbb�����'�y��C��A���^}3�����UW'6�q��{-��Q7/&5_c��,r��D�8�\
����111*����������Gvt�9�b�TTT���GDD(
�KAd'5U_���<f���d�������du��}L^k$��f��9��9������SCX�'cO�lO�^W��A������\A��n��y���v����[{��X�t�V:^���B�t���f��(��b�H%�{i����M"k+�;�mI�*)M{���{*9�q�MW�f��r��2� A�������pG�v�A�Hbc�1��#^�BP�"kqK��-,��G��]-��c7�������j!�#7�+�����;�����q[O��g�TE�j������%,�z���gzg�9�K#L�����q�\�:� ������:@�@`������������FV� � �N���������h;M���41����
b�2n2�-���V��!z�)�J�K'AD�N��rjkx�T#�{�������Y_����a<�7C��:9�J�����8.��+�l���r��D�/1��B�htm�h�L��S)�s2����m�h 7Q�I/ZB��0�Lm
/�M������^)d��`P��o��g�J�R��Z�'��E��"�n$�#&���p�T2VS=���n4^"PL7C�}2�������h�p�l�����:�+J[��3'��pxJ�E�\���Ku�����l��RU�����^a3�7������4SZB�����h�y�����G�����b
�T�����;&0{��
�N�*X�����W�oq�sF��M��J|��x-��	upW�_����
J���X������q0H��c7����c�L;��P�I,jV�KU�U�������k�K�5+�M��'6S���X.��O������n8���#W�k��H�sJZ"Kc��:�<���"�����#(���c��1��Q�i�
!�CI�U�'6�s��F�NfGt�P1W�d����B^���h�J4ZU���Nsry��z��A��d�Ut�c�������~�����8�W#�%�qm��p�����{	MW��9�:n$4]a�90��u�b
� �D
�T����m0:����t5��B��thA&�*��5��H�b�Y���>)�j�Q�^�U!�CU�
�8���a.�]�y]	7�E��i�����GN2��i���V�"����I��`��l��y6/�^-Z��jhnn'N�����E$4]�h��B3��Z&�
 AD��E T���Z�H"���T�1�q�9�����d�:� ��:��qP�cP;� � w� b�AD�����^(�^*����%bAS�{K�^�:hg���2TY"�M���������N�����E����\�����^Su����M�P�V}3�C�yaB�-{�V���e��ZU�H�:�m�6�2��i9�p��DW.s[��W�@f[*�����w�W�|2i4�9����:�+�`��uC3Q��yH'_0���4�Q
�?��q������w��2z��d����.�������f�U9����4��n��L�W��o)6�;�/hZ�h�	R�O�XU�Z�jx���/n�f����ONm����3�,�Cd��f��*�QD���h4�kU���c��1]y�*���.
�u�s\�X�V�9�K#v��&u����l�J��%QAS6k��8Q!�\��y67�K��9��K9/O���/�N�X�j��J�fS��%5�G���dz��2��[�g����3��v�|�/�G����� ��wp�PD/�t���|6SaU�8/O����yH'������i��f��n|��z�a��
�]��������O��6��D&�:�yyB�
�������f�������j�:�hVv�����&���jv�����$�_vk7/��������:�:x_PY9U������	6��SkW'�jU�D���Xh6jK[��b�.��G���/������up/d|�(�Iul6G�%���d�I�Wr{��c�2DJ67u���"eh��vlX�:h����r<���P	���+,o�����w����.6s��)I��:��P�R/�
{4E��<�:�^������n��%�����kV��y!�b9�v�@���Z������1'*D�E���_0�M�E}��ZU�D�#���o4�V�s_�Lv��Oq\�����q��C�dw	{g��WLADX����uX�5�g��7�o��4�R�UY�??M��&b�W��8~�:�)k[��^1u������o�9Bk�o��j+�W9�K#,��I�\�r�&�q��������WLAD�5�Sk������iv��p`�0�Py��!I�uP�u�b
� ��b�����&��/U%��}J�>���^1uAv1������G������
U[[<3s{��B����lhh�<���uuu���QQQKKK6��:� ����wY��a�=��3��������O��r��p����G�?~���155�z~�S�w���������??t���d�� �A���"v���O<���O^������>|��������[�������RZZj�X�f�K/�d�(uAv1�#�\	����a89/J���###�|��c�H�������m�ylNuttT(����?�v���o�:� �����q2�R�vV�Ru��u�n���b�^	�uP*����?U�Tg����x��h4~�[������}����B���}}}7�A�����������)R���r��HF������U���ZU�X�7�g��'Y���-�{"\�����w�}W�^�k���s�_��8p�zN�S�Z-�tww�p���Et�V�e��J�D��bY�-Y�P�^;�X���+����"+P�0\�0H���
Qe�,��?�p�����3_�6���{�����HS ������������3�����<00@:?���@�y)8D��7�����i�M�9��n���egG�>�&%%>|xnn�����z���%�R�����?��S��mN����>���R)�H�r��}��a�zp���'Q,5����M!r����/��):v������CV|||,���������������I�����|���ccc����#�<r�����P�z����_��6�
ka��3�'M����F
��	#r��h���B��p;^�Q�3��i���j3�J����V���OA��Om7�����r�o�x�J������A��O���@�UY_ r�X�Na-���N�5q��� �n��s���*�W���za���f����S�yu�;A�u�:s���e�>2<k�M���p�3�7�g
�z�9�80D|P�N��5���U��-���H���|������v����?����M��q7�5���
���=WQw]����W�]_%�IeF�7�d�7TU](��ki��HW���v��=�&�A5�p�<H'E�:��h2O/�~��Vo�6���m3�fm��g�*;�;�������/���a�5��M�y{�i���p�t���40��-��skg����f�� �q7Ctr��u���??�\ d��5n��GlR�L�"����-�}�dB�E���_�f��h�M���t�`4kt�)��i��;�����e�$e4����em�����Z���~xvm�P��ER�=r����];AZ�#�CpK�B�l�L-�����4*��1��/���}��k�V��6
)�>-��9���a`�����Yj.�l|�;�����M+��������d�����_��Rq��B&2�^��go_�a��U��A�{c6[����)<EA�!��;`4�9��EdW�:���b7A�W�J��]R��`���CP�'�+��A�� ���8�����"��u`�f��_�:�CBy��:��-�7T��*#���^T}vc���G�;�Pd��:�+�f�,��WM&��d���}��1�7����e��m�E�����������K������ Lu`��_?��h2+5f���tsp����;���,~����b�����f�M-��,���<X����H^�ukU��f������}u���+a"��p�8�_!A����+��U�3\�-o_����/v�����U)��I�
�����x����_5�z��4jU&���2�6\�O��JL7y���9�A�� �
�f�Qr�Gx���F����B�����[Y&4��~/�0�A�K����i�Ka}��J^����d�g���K#Lu��Fg�0\������9�b�9>o�_1��5����6�-��M�8N��yP�.�N����Z4}��xc�Z������\{z��_���a4��6�wbk�=8���l��BP��:��:`�������q2{$��6a��	� �Vp}p/u`+{��NZG}S�=�D���8��=w^�R���_U��Z4N-�S�E��D�����um�L������k����6�hL�B�����f=y��������B����lhh�<���uuu���QQQKKKl�2�!u2��*����!C|���_��!G��+�M\<�uO�z��C=D^�C�=z�����������s��*����?~��?���C�L�>��]Y���`uL�����9l�����{�����t���9|�0y�����[oY�ls���Kii��b1��/�����H�o���H��H�A>���1�zaa�������6�:::*
�����_�v��-�_t<s���#��F����-��lN��o}.������w������lkbb��r�%��;w������h4����T���ylww�.����������������wvv����Tgg������v��~�upjj����H$J����?���O�g�9���>���O�J�D"qtt���m��u�b�\�|���i���G�����X,>���:66����?��#

�n#vYXX����~�����;N���<r������O?��Z��\���$�H�J��~x���;N��:�R)�{����Y�V��$''�����2���?���:;;�������_ ����������F�����8p@���N77�����`w����x<��h4>���w��o�>��H:�����9���FDD���������|���B�ptt��TGG���E���_����������^#����]\\�8��7�())!����jmm-�k
�;�F������GO�8���D��\�����D''�'N���;�2\o���������Gvt�9�b�TTT���GDD(
����f0.^�811�����SO��������YXtjj�[o�����o��8;;���SSSL/�����_�������/--�N�uL/z��z��	������ehh�_��_}��g�}���,,Z&��;w.;;[�V����tppx������XX4~�������c��=��s?�������KJJ
��%7���_��SO=���Jw����K�����z��'N>|��?����������?��`�����<//oaa�h4j4�����?����#�.�b�>|���oee%55�W���R�dg7�����]��by���,�k����W_��������{������vww?��������_�������]S�����FG3�`�c��~������2�L���F�BQ\\����������{���������������~���������JPC?���������8qbC�R�tpp`t����GQ�T�u[[�?��?���������uuu}����b1�a�W��G�h4���G�I�����r���,����C=D��j}�QF��_0��_0�������c���Lo���g�_l����?���������{��^MM�R���/~�?��?�����������8z���N�D�����r-��O>���I7�b�+�����_�s����������c��J%k�����JAA��by��7%	���/i}^�	���G�U*�555?������bx��'].~����r-����R����G����0�������{�n�L�_��W����������o��o���������o���G}��g?����/caa���C�>v������puu}���/_���r-KPP���>[ZZJ����>����]�Q�T������;�~u[ZZ������g?���~���~����o���/~��oA��t���o^x�����_|�����?�qPP���/~�]��b			qrrruu�����x��{���������]�B�����G
��bYZZz��WPw�R��r�
;��d0jjj�|��3g��;'X�@����B��X\SS���	�T���	�V[\\���w���[�n��f��N��u+  ��@0
�`�c�\.ONN>�������������������0��J�����"�����"##��ud0	��5��999���t��#2��M&X����p��������6�K��x��-���|333O<������~������j����p������>������^x!##�t�0N���&K�Rgg��{,11���_����~���<������	f����>;p����Kll��~��'N��?���et�������x?���6����,6��M��466���������!����������%��BQ�&���/��B||������?���_������b�������_~y~~><<����r����3�<��r�!W�\y����9�nmzz��������Y^.6�M��M������Ksss,�!}�QzD��d�����K�.�P��ln2}�����###I'w��={��*��j:D:u:��C�{�����������g�}V����hl2;jkk�y����~�'//�'?�	{H___�6:�^�����?����.
�d67��_$��utt<��s�G�����_|�~N��bQ��o��&���QPP�O<a����������bg���{����=���������mrGG�/�`�s��A�o�&�=<<.^�H��={�������L/����� 
endstream
endobj
20 0 obj
   15966
endobj
12 0 obj
<< /Length 21 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wp��7pOd[���9C3�2Nr�d�&>�KrV|�9��}�.q2��9�%���9��Vl	,"!��(R��(����!R�E�, ��{o 	D��y��"��+���|�������?l��L��t���qqq��:::x<^ii)�se2�Z[[�&��)��GFF�������d\\\pp�N����B�����d2��$,�C�4'{2����/^���������6{Bz�����������|>?66vjj�>�F�������www?�|AA�Z������,ooo�����h��7n�8{����[�xxx,..Z�+��g�rU�\�B_����mmm����\�V=<<


v�d2���
��{�����CCC<���+;;;??_��Y�<��3D��GEE������'''�&��z���p�Ryy������������Oyy9�����x<�X,�b�����yww���l�A����c�I�f��Z�������������h���<����^�����r�����M�C������]�|ybbB��-..&$$�x���(2B?��;{�����N���������x��������t�wzzzPP��z����!!!��W�IRR����${DQ-�d2������h���<O�P�����:h��r�*���h���uss����:�juiiiEEi��u������P#TUU�x����tww/..&���ScVTT�x���	�`��OqIII}}��K�<<<���
�z�a��'#������x{{�9s&##C��-,,���xyy�9s&--M��P����QTTt�����O{xx���k���free���=,,����#����������7N�>}����7o��������;{����;������x���>__�3g�\�z�b?���Tz��L

EFFz�I�������|BB��������s�n�����B_���Pwww�N���W�^�9�h4VWW{xxx{{���J�R��bii)!!���������R#l����}�vBB���gDD���t������d�rss-�7)������g��>}:::zaa�>����Btt�������233�J�����j���u�������hoo���O������[<�����jZ��V����������x���?�������}�����������|>�{ee�����Tx�O������OvvvVV������vY/_�[XXx��y����������PPP������233����|�"##���rss��;����9JUSS�������G455u7���|^����P �x�.P�lZ���dee��q�����.�G������JLLlmm���d-D�JKK#""x<^DDYAY�d[[������W�^^^nnn�y]VWW��9���.
����_������;G��A�����n��h�Zj�f������]�t���0''�|���,���oxx��������SGj��/1Y:���������2�_�q���n��K^/�zj4�K�.��V�o�		�����:���{�����o�����.5����6_S���������N^���\2����e���i�6��H$����5==}��u����K�����S�TT�J�"�I�111�������p�t���d>���S����r9����!C��e���#���	�����$Y��f{{;���~�:u4aqq���y{{o5�qqq<o``��I�����J��}����������$#�u���8�/�����E������f�b��M�V������FMn||����������]jkk�_-��H"�P=�����k���^�gs�|�'���I�)��?O_���L�"�L&�IN��;��d�|}}�;D"��d6�KJJ�O �S���H�h4��q�u0**�lV�L&��Q����;����~~~g���^���u___>�ON��^�����tO��d�wLL���'�hR�����<���@�^�=�4��������������9377g6����y<�H$����p"����T�G����������;��@zVWW-zl����Z[[��k3��I�SD������12���A�gll���]�v�zd�^?99c��2���am6�Y���M���h�������"�������V''��rz��������E�X#[p����m��\ZZ�z
y<��i���KL�.%%�>�B�o�(
�L�~jdd�>����(�C�!!!���r�u�|��5��������:;;m.�n^�A�����XYY1��������{���x<}�&�������Q=+++���uuukkk�s}�����`��h0233�a�WU<��g�����!�jm��:H�&Iz��)��h4eeeaaadoe��$�c��d���~���A�%Q��lsQ=###���/^$_�-6lI��p�_${z-.m �"������V�%�_d�"==�����PK�����r���m�[[[�����VUU���l�������n�S���S$'(Z;}�4�l���j��>� ������E��/�Z�+kSUU��e���i1K��=H�c�_A����&�'�~��N��?���f,�bYY������WU<�lV������*�������__������
��m����J_�n����Z����*��Kv����	��������������� 9xG��GQ�T;�.�A*����R;���J��,�����h�d������ ��nnn����	��^���\:Rv����W�d�6��ww�u��=���a���� ye����*R6�}�W�z)v��'���:,���U������m�'�����|j�v��*~����U���a�r�(44������ #l�S���3<�~��}�A�K���sH���Dnn.9is����:������,��KOO��x����y](:�������|�2�O�U�<d,��neqq�����
�L�����/���#�����Z '\�w�����G��z�x��n��666n����.�z5-�������Dz���:��A�H��k%����������4///�nnsU��o��-9M�:�O�q�u�|a���m�/���k<�~��n�����uuu�bD�G��fF����������A��[�A���gr~~���BU
�^O��d1�]���-�1Y�������'�����C��`Q(����SJ���;��6��\"J?+X�P���R������:j��s�u�~�	�����F�;��w�������������QZZJ�-�~�����h4����70���
S�Y����N|"����0b||�~r�6WU<v���nKIII
�b}}=;;����A�`gg'�>77��j����CH[��0''���577��RX �������===����
!������fyy9%%��u.��U.YsNMM��I��TWW�x����N���a1�d=�������P(�J�������J����nnnR��<89�H?�G��6�sb4�|~FF�6O����d4'''Ie$w������\���	ww���O�����z�LF�{�h333���������z�~``�z�m��	������������A*�n������l[�����r�V���������C�!�����|@����WZZZQQ9�����`p}}��������)I^^^��7���W/^�H?P��U������$�!F���>�~Q��H�����{ee%Y�Z���288hq��n����P(�&������E����O?������Pj��\��n���+���m�n+�L*�Jr�%<<�:�}������������Z�V����}�����NG��OO������
���o�z�����1E(������>}�Z\\|u�~����u��n�����*���^�����I�6�}�W��������^^^mmm�$	}OOO�3C!;~�{P�[���
�G��������nnn������6�b1�nz�FcwwwEEE}}=9���]ZZJ�����h�����&�6���t�L������
�XL�����$�*++�	{��S�4?��E�g��������u��������E�g�d2
W�122B?<������������fxx�~TK��y{{�<�:11QZZJ?u�&��4>>^[[[^^.�H��V��
���%���R*�R��<�6��������WVV���m��l�y||��������P���v�9��,&�ill�������}�[-�6��^Y[[��������6�|_[[kjj*//ojj�~f�HD}�����\CC���FwtO�:�S�������a0.^�x����������h�t�6���#gdd������q�Oljjrss��5��Y�����������f��t ��&�,N��{������GFH��(����������f�h4666������dgg��������d2555%&&������...n�_SSs��m�O���������r���������FGG�o��@,���������T�^��q��������^YY�����������m�o0����_�P*����l~�X�A������}aa����:����mffF������sssmmm===�g��L������e���f�911quu�jVWWS��8[�����$++kaaa�~R����FFF�FcNN~�`�����!''g```yy977�������������������n��wiiiMMMUUU���d��T*
�b����������3p��u�O�������]��UBBU��fs]]����������T�D��������T���G��D"���
�yyy���o���P(�������C�^���G��������������EEE��333��R^^���a1YYY����o����������q������������)---����Z	����b�����)))����#b�����l6�d��7oR�F������I������j;Q,������`�H$��`������jkk�bqvvv^^^kk�.����b;��w[��&===77�,2����Z����[===�P�L���I���6�L&SeeeUUU[[[sss-�Z�t�y,�l6+��;
Euu������������V������#�U?Y������v�Z���:==�`�+<,j���j||�L&�zRSS����
R___VVF��,,,���m����V�NG�6�L)))�S_�g����Bz!���������l�R�N���'%%Q;���Y����^�A���F�PX^^N�---�o�&G)Z�6))���M��i4�DRPP��:X\\\__��j766�������gr�q��������$�Gtww�������ll,%%���������e��`�k��d*,,���"������(��(+�����

bbb��������&���*����8&&���������n����h4�D���������j��f���J||����&����������`x�mll$&&����O���l���������������k������������;88��������b=tvv���c�>�ldd$�p?����y�z�������D"Q(�����'v��;
ESS�SO=%��X_�������>}����QQQ
��l6��������������w������<44D:�}�]�P��r������w����������_T�T.\���$C�j������z��!�FC:�����?��r�����S�1��_��������|�c0�q��6�8p�`0�Nooo�	���H�533�p�������s�uLP.�;::���9���qee�t~��'�����;��x��g�����o��vnnn]]����Jz���\\\���������������.�}�v���#GN�>��������pvv>~���S�����I/Uw�}��C������N�:��[o9rD��s�X�U]]�����jI�L&����:hs��l.//������\,��[�nUUU���ryttt```cc���6�����������8�{�����'����1��z����#��?y����cZZ}L�C�B�W����'O���?=r���h�nQ���h����/�:XSSs��Q2(77��7���ls���KII��l6�L/�����H��Sxx����}�0������d�������e��6�:::��r����_�v�������_���&''�:�������P����/}�>������z{{����������&�oT

���$C�j��C���������hH�������'433#��������<��|||�B���+:44���L��Pgg���a�����F(���E���d"gu�	D���Qv����NP��sssO?��D"Q(�����~H����>���>P(����qzz������Z������?�5��fVW��b��rjk�@��L��.�3�o��]�`�	�T4��W�\qrr:x���c�fggw:99��k�=����t9�����k���?t���4	D�xI0�L7�;�F��H��X6���`����Y!O��7���z��?-����b7�������"~������+#j���Z���
�a��Y��$!��_��$���*����0B���.�
D��fg)�s+K�����:���sY�^����\������5z5u);�����_�����a4+�����Yd��@�W�6�F��h`�i0���b��g�8_Q#�u~��#�������><St�ex�'F��/�X$7{'����&�=�r��D�59]��
���e����^����d�sM�A��*�Z�����A��-��'���}m�
~7�/1�l�	D��?�V�������ed��j8�&`wP������-x4�"v�A�%�A��R5���9 ������A��R�'��-S�\���:��_��$���u�*�/x�"v�]BD�22���|p��_?�tkj9uE!sM��u�5�\k���xt�"v���������7��F��b7;M�N�3/���l0����"~�t=�v��~Q9���Q��N`O�"v���������$�l����t�3�����G~WeC�\T�)4k���hsU-��)G�����c�A�#������Y��	�+�F�2�,�M�^C��
�BD�=L�$8�;����S5�j8��f��L�@��R/'K�3�� ����fs\>4�>���CD�=L�LS�?(V7Z� �R>��N3�'qh���g��H�jCEg
z��^��#o?��~YG�Lc�|+�~Y��� �����$�����W����oupx�wp�kx��_�Q>�-���n�:�h3Tt&�'qd���gA�&�k�\�A���:X:��(�<�79�7�f{DT�������3�������A�=���`rK�@�O��$����$��q��b7\��[h[�X�z%
���V[g��z�r{Y������#�/>��M$��w�>�q��G�~�����C9����i��5#��/>��Gp���Avr�3Z ���
8_|����Mp�{���`Nw|l����8l"���	���d�X^V-���N9�:�B��~��	�`������&!�"���uZ$�zS�:�����0�����������:�S�����9��%��a��3�r�="�'uA� �#��m�����)��Ii0��Y�XV-h�j����78��u��?^������&��b��9�\9��~�~Q����$D�q� ��o�\��������q�
�eR)����7�g��Ip���M�:c���-p�����U��m��*�������Y'��e��q��o���f�B;����k��//�����L�ev��l��<����|N>�������5#��s��s�)���_=\�h�kN������#v�A{"��l/�~��i���fQ�M��O�~s���?m@�Y;r�{NJ�������2�����?�=��FRKH��BFg�����P����<�WY��\��o,�����7��}"~�L�J^���ls�xi��BNw<�������j��?�<'���,4�fnu^g��������[���d����y�d����7����c�����f\�Ev��0y=I�m����i�w����+��B���dp��B�=���3��or^ {Mq:�c�bOQj��^y��ADJ^O��<YMV���_2f�9(��|=��w�T:��b���Y�.�F�zYh��CD�KJ� Tt�a���c��iv�Ib�A��%�0���<���8^��*�ZE�HM�~C�]�.���#K��4�/�]]]/���O<���/K�R�q��a�7�|svv��_�����;v�����g�����9!�A���o�J����!�=ofw������Y1��J��S6�y��z�Xi���wD3�l�%H����H��)�">;��i��������IIIj�:11����l6OOO;99I$�B��{��8q�>����C�P455=��S2��zB����%���x����$��y��J3:�gw��f��#�#��f�B;�5��,--=��Sf�955����t����;���fs�������|��w�B����"v�T
��,�������x6\�p��������������_�V<x�bL����h4����������@D�2������FGG��������������'�������is����w<��������Dv{Brc��S��_��7������N1_����?�!$$D P��r���#}�CWVVH�'�|j���m{0��VTMETM%���/��r���$���"~�P�_����bW�����Rv�u��C���/5�B��<|�pWW�����uDDD]]����Jz���\\\�����������������fw���+e�y�����q�"'�,q&�������W>�5��Gn��0~��;������]S
-�4MT
D��M��3������`���?����*a!���������������o�������gg�����:u������Ru�j=499������So����#G�z�CR�
?K���q���7[�9/�$5#����c�/�Uw�X)���M�n
-���t��h��(��<	L7SZC+��F��_K#L����"�(   ..N�T��LHm�Qu��P��\^^%��mN��:��%��E��ba�b�$�������sv�)����,N�Q5#���S�B9y�d��K�*��pu�"�E�ul�Ba+�����E�yP�
;��%������yu@�t��t c��{���V�0�,���)���7�U����C��}��b(���s���>�g�b�u���y����%����������FeA��v���$��S�|�� �+j��dt^����c%��Z�Tmtc�����-!�\��KfWlX���-�������(F�-Su5#I����,���4���F>�[��^g�q�������	�+9�	�'�%��&������u�z�:�'H��lg�+�@~d���Ob�u��"x1w���
�:� �h�OKw�I8/�$v\������Vk��� � �0���B��������v����uA�H�p�N�A|`�����HrC+���6�������L�W����')��jd�?u��De�P�{[�5%S�����&����:h�u�rQ��q2Nrc�ZY=�+L�N��"�����Ra[��3g��n�l�n
�o?p�:�<�u0���������J9_a"��[������oa[X��JFg��X��������;Wq������b�h�z��_��:��PYHJ�_� /y����^����A�:���l��'l��l���!�$����Yjrt�"�A��-Vr^�>�[�Y	���N��W�����0���������7Xh&J?o��D�s����\�2�E����<){������SIS����>R:��:��P�In����Y)�s�b����o�9~��SU�k*�k��]&�')�g��6
/����d��]���^1u����j�E,���Q��%foK���Urx�������I�k-���wD������%
��">;M�T
���)��C��p��z�Mx��9�����do��G��H���(�ZBXhv�I��WLADJA�20[�B��&1ue����>2���u�b
� �h�"EX	���}h�a���|[�Z>/iLh*�|}� Lu����2��J����<	��a4��k�n������I�|�� {���M�Tu�^)���iZF4��J�H�>�F3�V�'a�����I�����-�WK9_k!�����`V��Sa��u��b����Y��������E}���yX3�l`�Y��n��q��F�M��-��Rx*�L"�;���u�%�uTC6��iv�k3U�:R�XNU��+j���K�7���KX�'7�R
s�Y�5���R����P*�E�8*�b�u�yX�4��\��!b�9&�sR��
����#�4�/u�b���b�K��tW�,����5�h�����)F\��e#��F�}�����D�|`~;k	�o�Z��*���:����M����_��<�2�A��t�&2��lR���*y[�����/2A����O�:h7]��L���t��ZQ�:�V�/UuOjYh�����!�^Z�J?J�rxIK���l����
�?��%6p^#�CP�2��z���kR�tshN_��Q���|-���,�2���St�K�b*�rY���-Vz�r�78(�����:�<xft���d�Sp������������?��:��PY��,� ��M�j�QQ�
� ;���e5�n5�<U�����!�Av�=�K�U��GD�=� ��Bd'���1m�����A�k�R?K����a59��������&��=�:�Zp|A�*9��O8�^c�.�l������[R$���{u�����=� {�K�+,��0��X����������{�j��/n2�-V��+s��hR�*�M�����,��D�8�R��9�U��p�)*X��^1u��t�k�"eM��oZ�7�+��H�UVu��L��{W��UZ��=�����fa��gJ�7�+l��������B��w#_�Y�bf�����&E|��
[����.9Ii�:���)��lfL�]��C���1�����f��&�^I�D������N3�N�:��^6�No�����eg�9&�K�7y9�YEf��f��������0�|�2����K���e�-�Pr]���:��7}�&N�9m�s\�N�c\;���l��6�����������I�_Q#v�������A���)�����)������#EI�T���������	-���)����Io���A�/�:��SPdoS���*A�<������������.,�=[U.�GGG666Z�osh]]]```LL�������"��u���_������o�<y�?��?�y����i�^�������<y���1--�>���B��+_����'���9r�h4ZOuA�0�2}��+a"����~��CCC�oWW��������G������7�x�>���...%%%f��d2����6�"Qd�ZG���~�� {n����{��������gyy��_�2}�C�r9������]�f��������<��e�7/�d����O�a�J�?����J���?��'���K_�}4�C8@��������~�O�6�tcc�������RA�5e��
[T�
�������Y�����S�B�oz�zI�Kg?��:������o�T�W/yzz�~�Z}��!��6�:88h4����q��E�I�uQ���m�K����7��YhV6t�o�R^�P��A�,��q��U����������%S,4����+��~����J��nV6
��*�j��#d�{M���E�R^^�K/�4::J�B������=44���L��Pgg���a�����F(ZO�s��>n0Ml6�Wj]���5�uG&R����T��A6���4L7�fuSK��li���p�����)))G�]\\�z����~�i�D�P(���?��>���}��|�P($������4�K���	Ch���S���
A�����{RG��u��&W����|-�0������c4>>>f����+NNN<v������l�����������^{��'�8|�pxx8���=����fh���������������%�W�bg���(��2�;���L&S���%@���
�����:'�m�	��k��4��2&����k�����F��B�\��u�Q������S�Fv�kF��4��zA�)�K�����}������u9b��9�M��NUE�{�6n(l����
���.�'�����Y}����f��.�V��t�����T�N�a�I~sy^5�:h7>��p	P�n�r��@�A2�dh��-���"����-�����F`�u�n��k�o��d��'�
��"U�Uw/�`����
�� �e`F�?���G�kvO�z�t�]��Y�A��7����d��M)���1��i��4L7q�*��M����&�Axpr���3K
��n������A������W���v������L�~y���k�k��l��|�2�n�&��^�k�{R��JmZS���e��"�=��_S��1&����=S�e1K���&� <����������C��=�����J�`��a�u�;��_���,O.����#�z�n�(m��;���&�p�����������*�������![3�f��f->��6�,4s5�+�����(�����dkF��l4�F��4�K�R��'������
������V��_���ts]m�!Q������+7+�Fgb��7��]>3�De`� �U�bA�W��&��/?Wr��E���:p�V����o|�����A�� �}0q|A�$���uA�&���aCkJm����*By��:p�eo^���q���J��K��R
���>��F
��
������
��Lu�Tg4��FS�h�w7�Wj�n�m���~\�� {�A����j�Fob�y�J�q�FQ;� ��YP-:�iV�y��K�b���e���WU��8��gPEC��w:��d)��3��u��$b�Ax�f������ij�&\*��kP`7��-Z]A
� ����.��%D�-����Yj^���u����.q��B&�:�����eT���y��4{�Yj"�!����'3+7���c�_|�"3����	HF�F�yU�KE�EP`7~uM��
l��f\����|q����q�����T��!�T�3�C�v�A����g�zves�L��y�����Y9K�
���E��.��!�����:�a4�������K�����b
����J���
�����u7��=��[YX�~-�0�A�������T�������p�_thn��!�4]F�Cs�t0� �M'�7�8G��^6L,nV
����nN/f��)��`�4��q�,+M���-��6U������;���n��(?K� U�����i�Z2�������#�_����%���[1���*1����/K�N��Uc����ek�&�A�����T�
)�L.��N��q}��_��2�D������'������r����sb�n�����[a�9�d�h��5�`TT��\.���lll���������������U�f�����*�����
U�H�`�Y��y�4�X���`HH�c�=F����G�9~����'����c�*
�����<y��?���#G�Fl�<���
�$l\�W��:��;����kT���9z�(�;77��7���ls���KII��l6�L/�����HxT����'u0::�`0PuP �8q�������/�>������r��t~�����]cw	`/��:HPu�������}*�/}�K��l=p��/���������gff���kjn�Y��\:Z-�g?�����-Qu0((������V�:D��P����=<<.^�������_5�(�	Q���A�P���J�rvv��fs������0���o~#
��w�3r���Ij�������=����D�P����~�!}4�C?���>�@�PH$GG���i����d�����
��������|���l�r��������;6;;k6�}��j����k����O>|8<<��E�������7/$�JW/�V
9R-M�W��C��W�~�Ft����GXj��)""����o�������Cggg�;�������FFFr4�{`zz���I"�(���{���;��
����������d��>�IMMuuu%����;��q�������|��w�B!�s
�7.\����I�V���q��C�4
�tww?�<�s
�7����|>��`0<���;=p���` ���������
������/t;�1A�\�����PGG������'�����>�{�����W_%����8���_/..&�?��jkkY�k��a0����?~��)'''r�K�[
MNNvrr:u��[o�u���^��B�?�LHm�Qu��P��\^^%��9�k�G�^�OJJ�t����Lnn����������p��NKK{��78p��!gg�?�pnn���������K���~���J:}�`z�{�1v&$����o��������{��'�{��7n�0���������l�J��_������W_ea�x��
�����'N<����?����������7��.�h�o��o�������Q����v�����o�q����G���G?���z���>����������[^^6j�zhh����?v���5��G����[__OKK��/�P(�YM���<�F�k6������$������~���*�����������gt�����>��g�}�����G��v��J�
�8;�	x��
�����'��������j��`���EEENNNL�9KJJ���o������wNN�G|�5�l<==��YM``��S�,:
�����5��O<��R�$wtt��?����8�)�L�����/��b���[��'�P��f���'��nI�����t����f����c�=F&��h�|�IF��7���9$�p�N�:��"�;w�zc��?���g��E|@���7kjj
��������#����L��ETT����-:%�����5���<�Lww7�������>��Svv[���|�[�:q��B�`�����}����l6���?�H$����}�
����?�P(jjj������<<<��3�0:]���ct�f�����R�������aaa�N722��w���F�����e||<����������������x�������������o3}]�����#G�=z��	>����������O_�r�����������{����������'Y;|�T*O�<�����u���������'?���~��������?����s�/A�j����o^|������^z�������~pp0���o0F�k6�������\]]===�|��'^}�Ugg���EF�+�����oR��`6�WWW�������Bq��Uv~�I�������� �P��9DMM�H$�������v�NH�R6O��h4EEE���~~~W�^mhh0�L�M�hhh8�<�@0
o0��5==���z��???�@PWWG���Q�����R�����2==M�#'�0=]�����s�����f��~��`�	�O�]^^		���X__gs���*���


,O~_��W]]]�����G?R�T����Gg��}���>���_|1##�t�p���E�J����O=�Trr�+����?������/�@/�L0�L}���C�\\\��������:u����v||<��&DEEqu��3�<cs�Lf��l.rhh(u:Ass��o�=66���������������D~����Efs�_|��������^x��O?%���?���U�999������Rdd��C�������o|��Nr����^{��=�t�������XTT��t��l:�<�������_~yqq��5��O>I�Qo4�������,,2��L]*r�����h���g���#��h4�#G��N�V���"��w�y�����O���������t�O������o|�CCCTO^^��C��...����et:��_��_����3]��l.�K/�Dn@���z���-���s���/��u�"���R���?��������(88��_�*�'%%��������;G�,--?~�����Ic�Y[����_|��s��a�/�&�===/]�D��;w��7����fz�����+
endstream
endobj
21 0 obj
   17068
endobj
13 0 obj
<< /Length 22 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yp�}?~Od[M;Sgh&V�IZ'����>��V��Trb7V'S�u�&N�q�g[q$��(��%��$����H��I�I�x�7� 	��
���
�:���~��>�K��`w�]����Ng1}yy��������_���}||������+**[��.�����F5y���\.%����x<��]qqq���:��^
����3�L�u���}=�]y����.\pww���sss6��Az����:$$������_�~}rr���F�������sss(,,T���\�����������SXXh4-n?>>�������lu����/,,�'�t���0���������������}��K�.�W��5m{{�����Y��������;.i2�|}}�B����������<���+''���@��YOy��g�^�������xzzJ�R���d�^ ""���TQQ���3???77���SQQA����~�����E��������[NN}�B� E��|�������x������5�F�������������|���ZY+|�rrr�������_ljj�����������:XQQ�����������u��������sG��-,,$&&�x���h������;{�����N� �&joFPPPFF�;###((��q�^|������t���<��1844������}��5�j"�x<^bb"}bOOOyy����|��A�L�������_������)����}[�_S_�{d4}}}]]]�����j��������4o�����jjj����y<^nn.i��������KKK����%+++y<��;w����!��������/���������R�����k����Zjj�����3g233u:���|ll�����3g���5
u���%%%!!!������AAAZ�v�������n�0�^�*��l�g���x<�����?�#���������g���q����}��������g������������a�[������=s�����-�#��O�Y��,:2<<��)**�~_;�psss��������ysyy�����0777�����._�ls��h�}�vhh������wXX�D"�x*�7]�vmff�Z`���]qqqbb���Gdd$����JKKI����,�7)������g�zzz������?@���������<w�\VV�R��X��/�W��5��B�k||<&&������3<<|``��-����i���222b����$����|Y����x<��!�W@�=�|~QQ����������������ME�=�|����|||rrr���������,p�u0$$����EEE</55���'11���������eee��
�LEEE�������?���ms������Q_M������f$cuu�����6CCC��U���\�TTT$x<^pp0U�;;;y<���Ovvviii||<����]$�������������F�	�Z��|���GFF�x���H����H{{������W�&///WW����{y�VVV��9���&
KJJ�&������F���������V���N��,�x��/����={�~k����������#5e��������+W


n��E���Ntttqq������EUO�Fs��Er�������+���{�:���{�����b��=$$����?6_S�)f���������pyyy�AR�"��~/��u/l��Y*�y<^vv�����k��wW�/��(�J����T*�8I3666""��A�gKII����K�5�C����r9����K>e�y�u0??�4���c���J�d�B�<���k������������$w����t��NF��ct�������z�� k���*�Y��w�����w�[�}�����lZ<rzS��������Rw711������C����p������s(��)�����[J__����A�����I�)�����"���d�r�
����=@z���K�K@�uS�4��\VVF�����$r�F����������d��d2���z��q�q��^�?w���3g�nmm����������}��W��!��� ��k1�,00��vss��x���F��|$���Ioo�������3g����������'����������QS�Z-��}�u�:rg0�����)6���j{{��++�q��� �u��4�����tJ��|�=<<,vuZ�}W���8���z����z�^*����Z?2���fm6�Y���C�������������L6�&&&�� �����Znn����V`�l������,����HM)**"����~������#[���O�P�x���P���P���P�#��{���+���r�u�|��5�����������l��^^��A�����\^^6R�444�>����������d2�x<WWWj���rmmm]]���*���������F�����E���<��C�#PRRBM!oo��~�u��E�L�d,nD����u+<<��l�l� ���� ������<�~@�����[4y�S�4
����������\�p�|=���%M���`u��
�8����$����pZ��l�-������^�s'I�<<<l>f�����l-����'���5<<���zii������Eg�P����H(Z���$�MT��D�V?@��y��������~� yem������{y5-�}m�q,�� �;��$���O���t�'���%�x��-��M�gUpk��)��	��Q�l?E���$���III���d��6++�]t���6�@�V,JRII�V�$6W�dK��4&�]]]ANNN{{{~~��O��Ar���g��w|��D"�~�:��������d�_��Z�~��L&+((

"�������E���������#e����z�N ��}w�[�������0a�q���W6<<|�"e��;��������A��i�����O^ej'���f�������j�9��s�X�k�W��O!������"���N�j���3gx<�VW���rh���deem�n'����!,��������yyyd��V�:������,F�edd�x����{y�(:�����<!!!!�t��S�,�`��w+�o�&/��}��=`�wd<	}_�2"���|uu�� }���{`����u���q�Qa��g���iq��=H��[\bbmm��� G$��Z������6Czz���U7�9��s;�N�(D��>5���� �>L��G�jW���Wy<�����������U���������c�_�����^������h4����wNnS�o]�2J�����BU
�^O���y���kii�8�d�tzzz���f�9))i�C�f������PPCJ���;�l���CJ6�P(����/6d�R]]�����:H�sNv��������� ye}||��=;;;�����6�[��W�a�Fr�4�&��@�aJ>qqq��'R�-F���E��U��X������
���ZNN9���u����|�����j���������rssy<^KK��wA���������dk"!!a���<���A�^���������N�M2r���T��,--�������\��5���$OB_����dP�N�[__��Y�������)
��������R��|����D"!7N�-����������F>������sE����\\\4�R��TFr%���������s�������g{{�^���d��I-6==����������������#�O u����n///wwwj����{9��l[]�~].�k����www��^�^^��D�0xyy�������#�nnn�	�kkk��,�����$///�QgF�����.\�����
��`�:(�J��.",,�!���F��
�������*�������!���;v����8�=**j����e�B!����www7����}`yHH�����*7))���e�����6d��#J���.G������������Y,�r�
}�E
��P"�|�'�����������Q__��SA�;��zM[[[���"
�����<==�����P���{yyQ�l����^��J���V���/5H�f�w|5^MM
��������N_@,�����?3�����������#���G
�c������������`0X�����2�h4���TVV�����=��g�E}-����@ooo��(������o��]]]=::���s��6�LCCC�����������"������#��"��lv���lkk��� ��,�������!E�;b2�FFF�7Y<�{|�Z[[+++kjjFFF�G�t:���wPP������;������6�L���������
�XL?�j�TX<������J�R"��������������������^��POLLTUU566�;El�q�~�6�b2�����K���A�G�U��y5w���jsssEEEKK��3�WWW���***�����B$Q_��,����E��m�ktW�������mjj����=1.\�x�"sg���:h4/nb����������2�	��`SS���+��?�-�K^[��1������k����G�H$�a��A��q��D"��������e���������������?���2[M7�LMMMIII���yyy�O���)..�����LBB�\.g�	.I6QM�N766���H~�������kkk�A*�������{\f��}}}999���z����+--����������,�JeRR�_�+u�hhh����������N�koo��������I����l{{{oo���:�2�I&�utttww[����lNJJZYY���o��np�e��^PP@��������m������<::j4sss����u4������KKKyyy�������Jzlhh������!?i������T]]���B�(J$�P������|��������5�p��5�O�677wuu��2[MOLL�j��l���#r�����\ZZ�X,.((`��~`O�H$����������:�P���O~����
��,���E�<������d�iii)##���922RRRBn<++��)��� ;;[*���
Cfffkk�=.�����z!kii�~:��������Z	�������7n�HMM�J�����������,��n��AM7�����O�vww���R�����qqq���;�b��>����������������������{\f����T��j:��h222���H�`?���j�7o�������d���,2����3�LUUU���TYlkk����8nh�����H���P(�6)���o��26�'%%��j����H���N:URR����V��������y�G�EmZYYIHH��d�����������������u��a544TTT������mmm:���m2�RSS�C_�_f��EEE�B���733��t��J�:177���L�;f]�������BaEEi�������V���������4�X,.,,��:XZZZ__��j������n��e� �Zf�����YYY��������T�����������7;::���-�	���6�L������n��������F�D[!����TXXw��-2��^��J�*--�������������e����F�H������Km���������`q���d*--���y�'m���III�k�������E�?���������r�@�5�T�
�y��a�7�|sff��_�����?~�����g������"==��w��O���rrr��
�����?y���s��I�P455=��S2���~<�O?����3$$$::Z�P�����42w``�;��}y�s��������w�}W(����/��/���w=<<~�������*�*88�����U���/os��C�4
�����z?D@@��������;�������������9�����L�����������i	������{�qww?��@ ��	��rGGG�26�:::.//���|�IXX���A<��3����~;//�����W_%SJJJ�;F_����_����L��~P[[�n'���W�9����������69;;�8q����NNNd�K�&�/�znJJ��������z��#G���z��p�n����������j��LHm�Qu��\��\QQ���-�������7oVWW���ryLLL```cc���6�������������vAoo��O>���c6��z��#GN�8q��)GG���t��6�
���|�+�N���Oz�����]W���h����/�:XSSs��Q2+//��7��/ls��c�����f��dz���mnE�M'N���d6����'�����/����m�utt���d��|��U�;� �������I�R������|2�`0|�K_�/os���}����������A����d����AAAd�Z�>t�}y�s4
����~���;����lkzz����W�k>>>B������vvv�/os���������{�	�B�{����������~Z,+�>���?�/fs�G}��?�A�P��bGG���)��� �:h6�/]����t����������8W*����kO<�����#""8�����	���)�4�u*v��Vj�U��?�����FG$��;�av��(1�L���������cKcK�N���hs|ipE��u�9�7��W'RZ7�`�tyZ�$�1��u���\w`�Z�)��F����������T*�k\?����Y��g���	�{�-����T�H��_3���D�����-��^�3h��b�i4qyE��~�@���(g����o�m������{S�\�@����g�������Q��F�a@�!�K��F���kM�mS��<�\�����0������M�Z�W���h����g��MV�����_7w���?&�����$0��/��y=Iz���'|����n�dzeBg�j��?���|1`���zE�`��_����Y�d��U:���8;��:Hex�'�������g����U{�~��cK3�w8y�u��������������5
~p�.��p���������E�$��:�^m0�m���q�@��L�r���y��8'�a���"~��*r$�j$O �wN71��t�]�	�P�OQ�0��u�]�6�)m���t�j�J� o?����G��d����+�'��
.�K�j8W3Z�t�gN�7��y��L�|[d�_������u#�}����F�M��H��3�b��O<���A��tL76LT��[�5��A������R�����������ed���N��r�l6����j��l"ox�o�}���d�}�����
���d������&5:���'���\o�m��'�qsI%�z������d�.�,�n�VS�f��� ��G�!Q5��J��:� ��[CY�#�����"�� �E��p�Rxp��������`��$��P2����A��� <���"�B�J~������T����Y	��_A2uc%���u�����Ji��(�4��C��)�W�i���]"��d�Q7!��s� �����G� � �G� � �fuvMJ]XFo�-(g��L7v� � ���������2(nf��@�g�i4�^}��@D�!#��������%��Yq��Rf�5F��������W�����-twL7���z��5mS"����GZ����7���(��&#u5�u�rjelQ5�BS�Ya�����6L��l8��7��7�.�����n�:G7���0� ���a{D�X1�+QV�U������7�:�Z���o#_Xh�M���=S���=���&�������������kMq-ov�_o���b�Y�r�e[�LcD�oFgy$���m������h�o����3��:�=����Ljk��K�h���]3���y���:h4����?� 	�YQ/q�{�pU;�Y.�$7:"k��G������5��-��0����� ��dP��9�$SLs�e�����	G��H&k�;����8)���7Z���3"~vw<9��t3�=�����Sz�Z"~T����q�R�6�2F�Z�hsy}����]�anm����Y������"��/����wD^k
Hm������3+�r�1+i�SI�Ni�S�8q��&�0������T�vM�S�O�D���,2���&'�s�4�?�9��gupw�=+.�y���u����s����������-��]�-]�-���mN�-ts�_d�dt���t�DEj��b(��������"�O�Wi����V'�6�������\%�;.�=�j8��g�w�-�M����@O�8�zs0����J�x']FF��Y����}�!�.e�yg����r�e��]�~��7�b6�������r����&:s�9���-�0�;�#skS��&)�
�V!S�2�^Q/qx�����XZ�9N&����bQH�����b��s��d =�'��.��$�4�����"��uft��t'������v\��T5N�Y�#=m��d�Y��.��[�p^��L�H'�����M���w�l0��O.� �;������O7�A�g�nv�4�O���\`������k�=R�;�J�W�S
���n���������:� ����`|������$���k��Ns&�7�zs�@��5�M�dtD3���i�-����+G~or~o2W�:!��a�vww����O<���/�,�H�������;88����333�b=wff�����>�lTT��;*��y��'��K����\���!SR����Yhr�������������'kjG�{f%,7���__|����d�Z������l6��������b�B�x���O�<I_���?nR(MMMO=��L&��#�A����&�����I]�^��-����}7{f%7�#�������L7+�rp���p�_tqq����2��iii...d����w���b6�:;;������P(��}�A����$�
fJ&���r	,����f�t}a_JZ[X|������B��_�7������pU�������)99�����A�����Z,i=���C��Ltss���A����&��\L���&���� -�O�|m��k���111�~��+�����t��9>�O��������0�^���������.r:S�y�s���@�Oj�7"�f��m������+WuLP.�;::���9���qyy�L���O����o������������*6S[y�5��o��\rz�o��������������������###���^}�U2������c��m�}���KKK��������w�I���8�9�q��t���r��B��M��-�W����:����{zz��7���w����j0���O�8q��i'''2��z�j=7%%�������o����#G�z�^���})����X,�|�� ��A��pr|P$������)�J2E&����R[vT�9�l6WTT���GGG��r���~Lm���~�p2a.�[���Av+��������,��������'�~������[�bP�2k�K��P(��|�� ����k?�#���jF�*�se]L7&n���q��FX��������:6���E��<yq���x	���)���ku�]8Ea0I��������dmLS�@���I&km^k
�?�8Q��Za:v\?M��p����6���,9���4q&����D�a�[/'J.Q��g��r1��b�N����u��� ]�BV���:�X��a�8n����4��E�.C�����c�u��"��W��f$�Z��m).l�|�� ��Y��R?�+j���q�\�Z�v"JW����A��z��x����g�?�'s�����b�����	R����ph��r8�f���<�k",���`Y7�h_%�����gnq��B�]Lvw\x���?2r�a�|����3���iNo��F����,�5��$�.��������\��^1uuA2�=�7:"��y=���KI���n&�^����B�k��G����_9i�n���5%�5\�+������/� /Ez����D�T���q��D�]LV������#\����������^���=��K��g�3����r1^�N�s���z��A����vp��q��#,����m��9�`����=�3M��e�:�Yhv��#��
D�����;UMw��i�-������Y���^1u��p{�"w�].o�|m��q�o��D|a{8��}�,4�o��^In����L	kM��:�<dr[���%���^L���kK����u-��r��(��b�i���"��VU��5A���b�r�h��	�A��A�?��<dP�G7����RV�����3�:�K������M���`<�������8*��cP��Bt:y��Ok^o�O����'���J9_k!��uA�+9�:�����b1k9����~����2�f�CKZ��j2;38_U"v�A���*����k�m��8W��P���2��Te6*3�,4���b�k�>���eV���b�2�GE=�9.�	�7;nr��D�2������F�R<�����jRj�B��c��fJ�2_��%��U
[UA9\�5���r�o��o_�q��D�2v\��6���Z�~�������zT���Z2�Y�7
iZF4SK���{�-#��v�k��Iz�"�|5������q���e�-�;K�A���6�W��]���`Z�FIj���&v����FU��P,��{���� ���[�_���v�W��]���`���iH#]���<�Rs\���N-n��"v�A���q��sS��J�d=���JL�u�-m"������U���M�=��v����P��]�[I���W�Le��~V��&���:����M&d�����Vl"���6��}zwW�c�����Y.���m�5���lz��Sp|���4��4�%5�\��j>��q�e�xvW:��.��u�w�P���Av7Em����k�l&�l�$�?hT����Tq>�E��b�up��:(���m�\OVbeF�2���fv�gS"��$�I�r��V��Y�W+��`�\,�p[?�[�K�4��d:�g��WL�oup\���$��#��H�
�2l��ff��%36J3kM}��$�|14��:���C�uH�2k�?���'����5��S�[�Z2L.�G����8:���h�3��k�N�n6n�)nS��l�	�7;��!u�TK.+��Y�Yh��j8�z-b�����Rw_��e��2��Pay���z��}X�*�����i]�T[��N�U����&e�Y��.]�����dD;�x�!1��N-z&���0�LJ�u�/I�~9���T�u���Y��9-J����*�Tq]���:�~�ft�m����&���K��o6�d�����v'!IV�����r0��cB�u�b
� �PF��)���:�A��\�����s��?	lfl^�u�b
� �P�,�[G�y-��#)�(�s9;^�-�����6� �<@&dz�?��}'�I�P��b��2����:� ����U��	E��wY�����Fez=��&%�{
� �<@��ho�oT
�9*�l����x����xD�?�����l^\��fv�F�:oA�)������lP���H���n\L ���&5P�G���R?y�\�qPS��.�{� �<X������h\����H�h�ps��������U��zP����kn6(o|�7��Uu���&�0N�`mmm@@@XX���<�"��cbb���9���.00066vee���pU�f��J���)~�t�:�S)��u����A��-��'�ftC3��)�����9_O"���`hh�7���S�N����3�<355����9r���S�N9::������9W(~�+_9u��O��#G��F�;���t�\QM���i~�x�_���T�����+���c���L.��t�N������L7G��d�*�kK�^�~��W�:<<L�vqq	

���9z�(������o���9���ceeef��d2����6�"?M^�S�������?'�����;������J��`��0tO��a�Y���1Ls��D�]O���� \�a.�|���cbb���'�����/���el�utt���d��|��U������rYi"M�5?MQ����>e��
�k-���N�*��yb�A��pX%��~�#�R�������D����/}��������z{{���Z�~���1a0M,7M��{������gR?��@w��H��=�����3����5�Wu�������V�6~�"((����LW����/is����Fs�
����.\�����i�~�ru����F������T7
�&Rj�������R�@~���������E#�}�cx���g)??���^#M�P���B�vvv�/ls���������{�	�Bv{���P����yCe���L�#�
���d�Y���Q��F���(�i"l��������G�.,,PSfgg�~�i�X�P(>���?������}�����B!����X��������)�XO�U�w��:��5,4;'�E�����@���_�������1���.]rrr:x�����gff�f��&�/�s�R�k����O>|8""��.�Y�������6���3:�F�wR�V���Y������A�H�T'}qw6S����������S2���3��z�
�����c���!D�G>�s:N��
$��
�i�9���������n�HQ��� ��dr��0��n��r�~�1������npFwg���������7�����k�=>w[�~��z��4�
ZF���j�WV�h
$�t��5d��������<��Pss�`�upQa�h��k���&���"b�[��+��!s���r������oA�����L&���id����{=������d2��0DA������j�L�������2
�]����S�zR����� �nP�W�d*{6~b��I��w�n@�_Zt�@�w�����*{tL7c�4MZ�?���JP����������^4����f�UF�N�1�&��|:M}�_�<b0��5���olz�7�"�=u��ht��)����\���T�h@�h�M�>#7h����_�+L��F����f�Dw[�%���
���:hgD��
���u�!�%50��_��w��e0�8��"�+A�3�������������oL&��������u�H���R���NT�u��WWU,4���<���&�<������7��L7�����O���3���kAx`J��:<:�b�]62�<��Q|���� ��A�G�d�]������d��:�hQ�sp�����cB����'.*��V����i"�Ax�7���#D�q��s9�t'.���V�q���I��h2M�_g�i0��J�������XP9��:r�&�=�6�&��`R�7��C�J�F3�v��~V���B�
� ��H�1��w����Q5Xhr�g�p�D����ASA�=N���{�:{� �hP`/�7��k�:1Na*�����m
:��q2SA���d��:{�"�u�8�A����^���2�u�y��9���HP`/[��~uUu�_�:vw��O7~�j��dd�U���B�����wg�\����u��q����5�8_K��q�����/�w�x�9i��&� ����0}��~��zt~m�.b6/��2j O��iF�z�I�b�X���F���
����(7�pnb���kj��,���Q��){�����3���7K�kvK���5�*���������+��{H����M;<���:���}|�~g��kU�c���"����fX�F�d�:�O�Voj7����@!� �M�/fcG�0�+g����p�������s�X��e�m-yT�5+{����k����ID�JX��5m�c�����!���w�TZ=����2����%�leq�8�h��S{G����n���k�l4��o��
S5~��=4��lL.Xh��g���
��i�1��Q*r�q����F���>�$.l�l��g����r��Xh�7���=���>ax���G���?a��q"�T���T��E=�o��g!J���B3G�:���������111�������K��[WW�����C�W�\y�����z����#'N�8u����czz:}I�s�B�W���S�N���?=r�����>��x��w^{�5����=z�������o��9���ceeef��d2����6�"������@�A�@p��I��������e��6�:::��r2���?�z�*�=xXT�������o����/}��������z{{���Z�����`[�����-Qu0((������V�:D_��\����r���_�p������:(
]\\����������l�uvv!�{�=�P��cxXT���}����b�B����>��C�b6�~��G��
�X,vtt��������f����KNNN<~������l����\�T��k�=���������,22����o�������sgff�?�������FEEq��v�������X,V(������'w���M
����������d�=|��������B���w���\gg���a2��w�
��?j�����A�V���q��C�4
���������������o�������8�����L������������W��|���@ ��	��rGG��:::.//���|�IXX��`w������������c���8���_/--%��������v��`pvv>q������������M[�MIIqrr:}��[o�u���^�u'�L&����:hs��l������������=j�G�^�ONN�x����t^^���������f������x�
GG�:t�����?���e�~���^z�����g?[YY!}61}�{�1v�H$}�[�2��ccc��o����O>��s���,����@PPPNN�J���/~�����������p�x��
������'O>����?������c�n����������o��k_�Zzz:5����.��w�{��7N�>}�����G���/����Q����~�������%���V����?������3z�f��������[[[KOO��/�P(�YM���<�F��l6?������f���W_����T*U���?___������=�����_��������W�^U�T����7�`���O~����n``@�V�\^RR��������������������������������a}}�\���������	<}���D�B���������'�xB�T��;;;����ybb����L&sqqy���������O<��Z�6��O>�$uI"ww���@F�700����l6���?��c��5��O>����
F��
����o8t�O�f��������F�������gQ�7������B����������D��"::����b�����k6��y�����������}��O?eg�UYY������'O*
�������
�f���c�XL&�����WaBBB��'
EMM���?>==m6�GFF�y�F�o0���_��|��a�Db1��������oTT�;��C5�F�/�������QUU�����7s������������G}���f������#G�=z����|>�����������t���k6�CCC�{����2jJBB��O>����Ry�����`���������'?��o��������?������3}
�V���o~������/fdd���Knnn����CCC�_���c�~�fsxx�����������?y�����������������o~����0�������}�{���E�PDFF^�|��_t���555����/((H(�0������D�)���555��;!�H�8��hJJJBCC��;w����������


��(���c���TZZZpp��s�A]]u�gF---EFF��(�����-SSS���111��:2����5�L���---�D�%2��.,�]ZZ�r�Jdd�����K����������7??���~���������~�#�JE���������=��_���_|133�Lda���f�%����SO=�����+����������/��1L&�G}t���c��%$$|���=}��������F�����g�y���2��E���rXX5��������gg
������Kv���O"��BQ@�����/���������/|���d�����0}�dnn�+�����u��!r8rtt������|��k����u������,))a�~�e6������x���XXC>�����z�������!!!,t��.S�������,�qv��yr��F�9r����jY8U��;�\�v������|���t:�w�.����������05%??��?�!k�c�����R���t��������tQ@����K/�D.@���~���-�G������^z��N��lV�T?����_x���~��_�OIMMe��Arm����SSO�8q��9��]f�����/��"}�����>������x�"5����o������w
�g����-�
endstream
endobj
22 0 obj
   17487
endobj
14 0 obj
<< /Length 23 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���y\��?~�m�x��m�6~�i��}$��4m7�t��d7nJ�n���8��i�H�������'calc|��X������}#!H!q#n�)IH���>�D��f������C�Ahf���53���(Xjkk���n��UUU��-er���T*���&S���s�'�l�����''$$��4��������{{{GEE�t������\???oo�#G�$''[�����������;������O�A��,���go�"�h�`ww7��+))������<���)99���ctt�f�EEEyxxLOO����������������899��	SSS			>>>������
�������111��;~�xjj��lv�kbb����Y�X?
�^���(�v����F��������� EQIII�_���V�����x<�^OQ����D"������8�<y���U���P8::���y��I___�R833s��1����������WWWG?!,,���c���333^^^�����������g�0��z��i�������C�,���#��2~~~^^^�C&&&x<^xx���>��z��C��=J/	���������INN��xmmm�i����9���m0(������x������.]r\�._������e


�K����v{kkkii�X,���!O���)//���Y��333


���������+MMMI$����DB����P||<���p�;�� ���D"QYYYSS��d���D"����	�����TZZ*��nqL���:""�f]XXP*�eee���:��jkjj�������v��W0�L---���dJ����?]�v�L-�����������p��v��������X�����"����6������&&&�����6��:n577�)��� Ct:]���@�TZ\\������Y����+<��������A���>���R��6��R������������Y2P$MLL��l������?	������u:Y����|��f��PPP���jkk)�R�T<������� !�������
��&���EFF�x���<�P���w������������s����� �W__���J�������/77755��������*++���sss+++������)����

��x�%����.����x�K�.%''���;vlpp��Zss��K�x<�X,NNN>~����'����������������k��������"����B�������������m��;99���{���������c������eN�����zyy��>>>��#o,)
F�Q xzz���'$$xzz���8�������������
OOO��:���������{��YOOO�XL����Mf����1�555d:��)���*R�>G����O�!d�n�	����x�������'N����x������r��C�222���BCC=<<���o8k333�7���EQ</##�q`ll,��c��*s|�B� ?$�������G���������D��522����9RXXHV#���<���/::���$Y-�k����Z���������Istt��uMVz<o�~Q�LF*2iNMMyzz����#%����l���t�k�el6�������e������������		��J���s�H���|��E�'�����Y���$M��mYg���8�'Hg��"55�^��F���������#G�A7�����{���;���W
�����'Nxxx�����X�Vwww�����u�����v266�r���~�d�*44��� �R(6������CQ��`�����nsR�o��uP"���F�^�@�i�#�@�������L>V�NG>���!�����TVV�P(�i`�!z>L� �������O��x��]�����������V�y<^VVi��E���v{MM
����u����tg�����N�I��x�:����T*�}Y�������r�����~v�cI���M�l8���X,555��L&��8���/������F�L������V��C�<==�[D~T�O0��wQ���{6VVV.��%[���X^^~�u�K?���N����I6�����a����TI$rl��'������;q��3�l0���:����0�!A�|>�O������'O�����F~+����>}���.  ���K7[���D����EQ�O���x���B!���E����v��f�������#??��q����fwww???�rD�^'�I=z��
�/SZZ��b���,�������d�Mo�R����r����l&�	�:���P���A6@g���S�Q'���l� �-|Gup��>2QQQ��U����	�:��g������ ��h�����;B��<y2///33���?00��02�iii��B~�?~�4%���{YYYvv��C�4�V����hhh��x---�`9�d�g�C���K�������*�q%<>>���������d5N�A�������nOO���T����:��jo�����H&544�v�K ������&����!�ez�$�������F�1???00����AAA����A�>$Dcc��f���%�a����+11�N�`\\��@�e���g���|��������Bww���x�=�LOO��-~ uvv^�r%>>>//orr���hnn���HJJr|�����tpp�������l�gdd�-D���c�SSS+_
�qwWggg�9����!��	��,��i���7�nV��={��zCf������R����t��x�����/��v��t��Dd'�-�WV.�A�C^���e��&���e[��x9n��z��'H�`JJ�
�j��d2�����dk�;=>xC����?����e����,n�����������E]�p����d�]�8wwuP,�x���l�	����:HX����jwwwz�p���>>>�X��~���������������'y<����u�t�!�r|�6� �/z�~��7#}X�wZ���&#��=r���s�����|������n���^^^����3�w��L:~�8�TR��}vd#�}u�I� 9�<??��	�����d����&B������g�:�CBBN�<I�S������r��CVY���*,,���<��A�T�M�����/����XP���o����{yy�}GW�,�'C��F����7��:t�h4��gTPP@9v������dDEE�u?�/�oVM&�����#�Bww�����5���98��!��P�W���n�e�B�K�����~Qr���.7p�#���t#u������=7Z[[y<^dd$YS]�v�t(MNN&O �����Hsaa�����n:�a�l-J:��?~nn����
����Phh�D"���|>=���Fz=L~�;��!e%::�4I���G���D�!..���#'N�p��[y|����0��4��&�J����_��������"�Hd��fg�-C*��OKK#��j���www����;�TUU��J����uB���?��
rppp�g��� Y*V��xR����s�7�����Pz������%��H`��k��yzz��"���/^���l�ZMW�����{�\.'�����5$��			�~������6��l4�:u*88800P��:t�������l��>}���7///??���3�N��/b�ZH)LOO��Q����K���O�<���N����������!�����q��������ccc���H�.,,$�
�XH���*//'+���R�y5$$���c����Pcii��d���'������]~�b���>>>������u<�J�����{zz�={����B����@^�'��3+?��3r~\TT���9����������###d3����'O���������*����$�?����V�T*%?{����r9����`}}=�M���]QQA_��,������YYY������?z��
���Nzz���glllbb"��\�x��aaa���|>?88xe?gRF:����L|``�T*=s�Ltt�jm����/�E�#�"f��C�RI��D"2�f��T���2�TJV�b������h6��������j���|�Fr���:�
������������Z��=I��.C�U�i_�600P]]]RRR__O����r��tzz�H_������������n���J_7����������Puu5�,�������E"Qii�Z�������������v��~�V�%�U+//�������u:�D")--U(&������y������,K�
��������F�U l6��������e{2�"�M�����Qr=�Ju�����;n,�E.���'�rrr�y8���Lr�f#�so��X�W��$nq��1�L��9�B100����m6[OOOSSS__��&`ff������e�]&&&
�Z�v��EQSSS
�B�T:^���8���I�R��[,� 7H���%�/��=�u:�.###==}~~^��eee���>L���yyy������W�^���_Y
ccc����juLL���0������������G�P���7..���C�R����w�P�TB����G&�%&&��z��>>>.
����M�Fc\\\SS������T^^���Ik�Z�@��h���F#����d���\y����4�~Ouuu�~�6�-66���fII	����n
�r������z�*�pTT}k����:���������X��F���9�A�^/Z[[�������			qqq���yyyB�0333##C �����������IKK#exx8---&&���/===&&&??�}�ccc���+W��1�n�������:�nnnN �����j�:22�f��jN��loo����X,������M������H����4!!A��]�reuo~�2�:���M�uww;V
�XLnfTPP@v�*����hrs���������QR4�����mGFFA]]��n7�����n���J%�S���������9111R��<&�y��5��D�6���A&���W �	����Z�R��N���V�@@��`0����Z�������������������\��6�-!!���������4;;E�����I(���<���
���)��������e0??���PTT���-����V�A�L���������Mo�UVV&&&����T���4R;;;+8��cccr�<""�~MR7o�^WW~����zU^^���������322l6�������d��>99YTTD���999yyy��UMMMlll����>��%������x�4�F�Z�P(&''����}������ryGGG__�@ ���%S���+��{zz�juxx��j%���	�~���������LOOw|`}������	�BA�I��GDD������;v�Q�T�QB"���+��f������8�P�
�.�+��J$������������EDDX�VR�
�Bz�������J������������(zK�t����JOO�O�������&{D�v;9l��h����3��z=����:H����0}��
�@�y���-n�����T*����d�LF� ��5��111�������b�X�RI$�����>H
�90>>~zz�teIHH����K���HTT}~9�O(���(
r,�n��������T*�\���a^�P(�����������i0��III���]]]III�y
����kjjjjj�\�B��WVV���000P\\\XXHo$���uuu�����#�����B!��z�����y��������O�������A)X��'C���t:�RIw��(*>>�>o�f�5�=z��>T�V������4R�4Mcc������MMMJ���2]�b�����������L&�J�����J8Z��f��!I�wvv��H��l������g399��������e2�222�������������d2l��I,����M��������F��/##��������Thh��3gzzz������������N����I//�'�x���������������_��>��|����|��'���s|���y��������G}������d���b����o����9�r������p��{��Q��E���m�����_�J,,,ttt8�Hdd�s�=G8p��?�(j����������H$,��XVi�O�~��w��w�}��d�����={������[�l�Z�6l�Dd���k@@�sp�nXggg��_�����������_x�77���A�g��������qjj��>855����T*�����{xx�2+wle��lo��fyy��'���;o�����]�v����<
��6m2...r��tss���Y�R�����^��v�V��O?�4==��(���l_hpp��~�+�!���tepss�(������%�l�"
�������7�|�nFFF������%���E����_�o�EQR����S�P(6n����(�z���bbbH%%f4����?==����XKK�����~���}�{���������_��?.��������R�FFF\]]_~�eWW���w���QN�&
suu}��W���g��)����;w���c���[�����Qn,XM�F��l`�	pO����Y>$
�+�r,V3���1y�*~�0��|���67 ����L�W��G��*hO�L�3�,��"��5�s=����z�BD�����B�qo��4����
��������,��v�M=*��m���Z�Is�������f�}�4�����t�I_��aldv�uL"��7��`��6�"��v]�z�Y5�f���ZU����'_3��9��0P�4$&�O��M����'��^��,\����^m�-
��h�p�'
c�����!�����y�����5O�[��tQ6$b�_�q;��c����Y���f�pm�2�uL���|�%&D�o����Rd��������4D�/���3��1���/��\��<���]�d��m�7����
A���6����`�Y������>���7[M,��]-������T{}���Qo�$��5��P����<�g����~�����u���y�����*�s:���,���T��F2I�5k{K�N~lpu|�zi{Ig&�u�C�:��Z��Aw�d����+�H��le����"��L����R��)!"~������~8��������0��wG��c��g�S�fBD��b�q���n���@�������������~�����������z��;!�d:��'3������\�v�7��I��H:P!���P�L�V��Wk
��������sB��*������!��k�V�_�L���m����N��U'����#6�V��W�_�����pU�:���(�%�J��JQTY��Le���$y�k����O����P����tO��]�`��sT����N=*�o�1N��>��5�we��5p���>m��l|���s(���j�6���JNf������W����33D��k�y��v�����j����:rF���i7TuN�0�T�H{&���M�A��kd���`��<��t]99���Eqo��)�xUO^~[2���0��	u�X��L��f�c}�����Vg�N��h����������T�`
�M�S"�	�/�'�Dg��LeTIG��1�dw��8���f���QK_�����n��?"��g^#rw�����><�?2;�>�"��6���6�]E0H�0��T���;z�GCD���TF�V�l�&�9��u��������P5T�=Wmr��X55��?U! �wm_Id���V!�MrB�+A�.\��V��qZ>$R��8�Z!��������ValCP���j4L7�f�_(A�\�]5����,���K�������Zbm�z������b�^_��4��W��5�����)�i"��Q�^_�:�_��o��Ts�x#�|i�^e�:��t:N.�� r��z�	�P^kbT�i��� r��jU)�����TTT�������l62�b��B??������055z�����/�G������m2���	����>��mA��	�k���I//�'�x�����FGG?��sR������?��c2p���}��T*��������"�a���%%%�>�hgg'EQ%%%�7o��|>�����|uA���������������n��G
)��j����7������6l���-��3;;��'�p|������{�<>p���~HQ���[��-Zl�x@"�}����Of�dY�jU�X\\\���H��GIKK����������n���	�����={������[�l�Z�6l�Dd���k@@������ k(\�*��L&sqq���#�'�|����AAA�<�2>>������J���;<H���>���SSS...J����}��������� k(\�*��Z�vqq�j���������p���O?M����h4��w���w�^�X(n���`0�����r2������g�x����ye�E��y� �~����x7�X������o����C{���w�bqqq�<`��)�jhh����m6�����~��7����777��~����\2p��-B���9
�dYC�j=���7�|�n���k'O��(�����?���b1�L=�������{�n���R)���(J�Pl�����y���bbb(����OI��Fs���OOOs5wN�dYCa%����������{�����g?EQ�������^{�5WWW�LF�)�^x���^{m���d�ixx��M��_���\]]_y��}����F�^�s��;vl�������Y�dYC�z�	�����!"~�����A�K��*�e��#u�k���3����3�\����]�/������65���e��_:P�ta'����mu�e!"�ey�v~B;?�5�b�9e��vq�`;Cz&[��+�z���dH]_;M�J��R ��S������������f�ym�%�������V���u����,��D^���;�|������T�N���6�LtM�T����m�zCD��Z���(�%�
AL7�G��p�,�+NV���H�b'�
gCD��~l2���������F���/��,��h��h��+![�L7�Sb�2[���3Av.F�t3�-��<�}�������z����$q�g�����E������i`��nH1\����@��o��Y��5�_�=-
`��I��;��N����&6
��?��-���kR��V�G����/c�)�G��|�\���AIM����p}3�#�-dy��In���NT�H9��I�:���/Y��[-���ui��?�9��3��3#�!0D��PF1�,���m��;��E�:�W�����+IMZ�5p�&Y������<����:��s����
���.��>�t^Ik��F��%����}%�6%�eQ��/�C�4d�b���d��m��.I���m�b�#�zU
����|5��S���(;���yyb-�y�����z�U�s�
+M��6�(�����l&Zzf��/��������
YhJ����]�d�}f
��z=
�Bd!����~$���	����Uq��"qo9����:��=\�Q'$+.�(.�f�22Av.��=��5�
� �:'�'#�-�V�������+-1L7�C�hi@�:����f�����.�J��#Xhv�s�0sU�-�������L�xs�4 D��Q'�M	�<��f���j[�+g���P�<�FS��B�=���vI������X�!���wg����zU
���L��z������w^?��fEw�ey���D��oY0�DX�]��K>������S�=�L7��]�=���A�Jl�R�B�U��������+D��3��{��n*G�y�B���s����l�)�L/��H���S�H'|�����-Q���N�_��UT���Tqk��4�%<Eq��&�$�R�-���c����AXu�����s��BdmE�[D���:���~2���	����+	A��M�*���7D���MQ\��?��|
�\��U50��~2���BD|��C�����=����K������nr��f
L����A�W������A�,9��R6W�.��B�������tf������4�hS=�8o��_I��s8�B!��N�-��n��j;BD�������q���;s����=�m���;quW����U���We���Z.s>��Z��>8�I�X�dkUO>��c\�h�Z��5�����;	.��$z���	����IS�q� �������'[����J���X������\�� �oF[����A�z�.N����y���
�N�O���sXI�e������u�p
�}�7Ah�s^�X�SLS���m�L��	X���Q����	/�e'�IS�x�������%[%LmJOm�`'��T�g�y�u�p���k� �hN9�0��F�����}���N��k�Q�+����g-)rn6?��G3�{Y�} n�|�8;�����Y8O��P�7G&��su}~����z������`���;���z�t��3�\��T�=�z9���)�eb������'[E�E���Q����1����St'R���i��I������L9v����\e�Dh��U{E0_n����v���5�*��*QS��=��x��J���u�p
q�gCD|�5)�E�q��b�������������������RB5�����?i"[%�J�p��J���XJf�����������������K�	?�����	��Iv�?�����)��R��p]"�������6�������I5�u�JKL��/�����JK�����g��I����-�A��\&��$��NI�b-��In?�O�'.����#8�����b���5���9�Ml��*�����[�k������i ��I�X���{�����MC�u���G�h���w;K?�I��N	Jf9�M�;�e�����Sl����z���1�_j��u]b[eO^a{j�x3�e]Y�4�JYW��;q��X��>C��ds�E�#q2M��Bu:��.I{c�R��9�^����8��� At�\��-�9��^>���*��id���\*�=�4����HQ1��kv�u]b�S��������Nr�O�A��9���}�����py�Z������5�]��g$��U�29V����RU;q��G�,�g@�����������v�����bP�!r�������	����r�A�z�5��9�'Gd�S�AN��)���p�=���r�2A����|�6	�2�(e���\�%�9O�oQ��
�����tS<��D���zl"�WJB���L��'p�n��J��^��.���:��t!��T����a��x���h�"��%�'��^���kOe�����K�����C��$0���9��Q�I������z�wf$~!"~nk"�"����cWTl_��;��^�/��_A�/O� �8��ByHX�QR��%�r�	��g������"r/u�M���i@�����~UBd'YR}�HYt}�ln�>����U�!�A��h���NAV7��~�9�@�����	��K����q ��)������444,��������3g�����������M&�
G�����.n��_�w���d���|?L/�)bg��E
C�X��04e������h�����SM������h`�Y���
G�Y���<_���v[�V��O?}����%���w���(�z�����O>����s��b06o�QRR����vvvRURR�y�f�X����o�~�q9U-�����������9K���B��ep�J�k)n��k�o�Y �����C�4�>����YO9��u�:h6�322�c�\��������_�����������|�����|���Em��U Pe��x��D�r\�S���rX��>D#f�+ �)U��I���I��'���e�\3�A��M�6���?�5�/�I����{��?�<~��wmKV>������gy�e����a��HD�����G�9>����a<�5�$�0\�-�����{F-�W(v�6lN�����#���[�e����h�mh!E�O;�B���*3d�u3k:L��;������pll�4���v����/���
:>s��$�SSS|����)�RIn�����c�(��r��r>&�t��D����rvX����I�&��w5�da�b����f`����6dV
��?�P�w��,��c���������nfI��u��e)��;�����?�a��w�y��7�p�k���{���B�p��M�����t�!5���g�K)z����O�$"Qr��x���BhVo���u���������������������f5��"Q{�x������a4�u�%,W}�����5m�#�����a�$0��+���k�f����?��O���+���W�r���MW���777��~����\2p��-B�p�K9���C�=tLv���%����&�5���5�7na���I����*M���[��u_������g��>d.o17s��r�0S������������y���d���E����_�o�EQR����/n�)7n�������K111E}������F��������W�up'�� ��zQ��o�M�uz�r�L�����C��K��}��ou�	�z��������{�n��������������w�����(*<<|��M�	aaa��������}��v;EQz�~���;v���ukQQ�
��D�����
���o��20i��_�9��`(S.&�N�x����mR~q���4t����#�V��WpR��.���Y�X�A��:���1l.R��4�7��[������&W)W-��B�B%�N������������M��k
�BE��f��=���c��?;/q���Nr]��>�v�??��kY�H:L�k�5�F��Y3f�X*��6{�����.bg�u�m�u�!"~fK4�h
�=�B3����[������T>zNn�bop���5/�,��B�d=ep�J/�}���k����a�>K�D��\�%���Di���1Oi?���u]���_lS�a�tK��K�v�N��]�����N�+.Y�)k�����������'s�8,��iS9
l/�������U���Ti�+sMA�zz�,�v#���c��A����`���gs��*��Z����c�?� k2����u��k�d0���J�\�%��_=�z�:����XN�5AY�!w�J��K�M�������5K��h�u�m��,��uA�v�',�v���D_K��f}�b��[[r]����u���<��c��5s�e��8�A����Nn���u]b�S�����.+������J�b�J���H5�F��5m��j}M�� r!�0f�qp�����6���i���f�K�E���5����"�\������� k+���q�:�"����U�J���H�M�A���]�m��"�6�:��AA�{3�]����u]b[�D����I&rR�qJ��1�AA�@{�����o.����>���&�u�m^IC\]�lo�D�l�������'����&�9�U��+� �l4c�����i���We���������.��+�WuPP2�1b�n]�U�l�1w2���4eI
�� k$=c�����7i����L,4��Kl��<�I�H��/�<�g��0w�\���kLr�T#�����.�-�p��+]��L��dK
]��i� r������J�<�������Q\dA�^�u�m���Q�Y�z-�7��p]���I�i�7i�q����k��WN������/>f���.�����f�3��`���m�n\���F�U�u�m������S:FP�gz,�SV��NQ��K�wB
&��������]o���Y����j��l,4�����������=��:�Z��i�-��m
�w�7h�m\-��{A�����B3���2`�r��#�j���+�p|Y��o���zY�}�&���d�����m����G5h����,v��n�h���!id������Oa"&��A���S��N�sd�-���p��b7�dj��t��������``���������:	�_)�4����[{dI�,4���$����k���R�T��T��B����H��1t�r�O�}�8>���p�}b[Z���Yu����(�4��oJ�L�1K����1������O�lZ�mJg�vb�d7-���Y��!�������i3��c�o���5H�T�F'�l�/f�Z������6S�����*U���4�*5��?_0����.
�rL�rL���P3������p����V�d�iX��M��>'�*������W�m�������������#�����"��$���
��,4�����m'����7����Wx��2�D���W�m���d�iN"pB-�����vZ���|�����Q�lsDk������ul���_)��2��^Kt��`B\��M�)����wB
CZ�M��}�����	���_�{���A���s�b��n��-6�|u6U�p��I��l ���n�9UL�[�j\0[7������g5^�L3s�D��_��s���Ts��'��
_/��63��>T�u��k�9U����^����S�uln��?�����B��a�y�WX��z�f�S���r��t������gW�p*�4��1JF��M�}��X�aw��Wu�u�r����f��>�7��tsb`���;\/_l���;��;�x���y9�w��6uF�������X��B�S6�m�^���U�me*K��LOI��\��F��Y�=�R�8;JX�j~�3���yv��	���&6��"�6������M����W�k�M�F�e�	���y��n���n�g�Fg��Ml��E�:�Lv�siiid��b
�~~~555+�ejj*44���3===����F���h�����qRc�L����_gV���������l����}�b�6m��]_�0�4��s��<����Nd��0��J�G<s����{@@��'�}�]2p���}��T*��������|���y��������G}�����������7��b>��}�����:�����z'�k`M��I_�$E���c������0�ME��'�L���uF��^����ccc��J��(���k��
:�������'�x�������=�y|���?�����[�
��KO�x��D�rD��;�U?��KR\1*�YK�
��vv������Y����w��.�����?��c�qww������������=�qdd��-[�V��
D"����rD����I���;�u�2��O��9�j��~�����������~��_VUUQ��3����������������c��������|pjj���E�T����o���X9"�A�#���t���8��	?+Z��}m'��4�@����l�.<���d�������F����k���{���B�p��M����E.���nnn>>>+_��g�������'�W�65�$������/�\��)�j�z�YRX��X��S���[=<<�J}�]MMM...333���<�9������������boo�7�x�<pss�(������%�l�"
W�>'����/������(�����)uI�W7-SY��{�L���*��|/2�|�)y�}���O>IQ��dz���H___���wS%�J�|>EQ
�b�����K/��x��O?%f4����?==�r\g�/n����#0��B����7n��2�L��|��`���zW��%����f�5����Fg�h���u�f���/y����������mmmd�H$z��^{�����k�Z�����7m�D������+������P�^�s��;vl��������"w���f%� ��7������e�R��Oa��Z�����E��`��`���V'�������4Q�����j�4�����5`�������x��\������o:��1���}q�O���1���!���B��YXT���
[���B�b�����\�7�rz�����68�Mv������?���������[�����7�|��|���p����p����[���>����FQ�N���s�����^x!??�����7o�z�*EQaaaO=��{�n�`pp0..���r�|���:��n�o��-""��������GEi4����ojj��Y�cJ�rff��(�P��_����_�r��{yy����q``�o�[��z����<2����JHH`e�V�[o�E6�(�ruu=v����Gkkk�=m��]{��%�/_��i�&�^���"���@777>����|%���o����n'�����������������;v�8x� y������j�Z�RIn�������9�K��������h\�'�g�}�q�|�g��822��?���j��a�H$"�n�������>7<<�LY�3���������,EQ


���z���������c���AQ��j����(*::�7��
�����?��c��^|��K�.����?�J����m��lO=����G@@��S��z�)�BQQQ��3�������?��#�<��g��w�}E�'�|222����'?�IOOEQeee�7o�H$|>��W_�z�n��f;�����)�*))������ �������$��Zmhh��3g4
�R������111&������TSS����������NNN.����n������'��l6��t:���;���?������M5�*0�?�����(�:z�����g������7�t:���m�����������GQ��h������)�&�+)**��q#y�����o}�l6����z��7���������E=��Cyyyd�SO=�������������~�<qqq������k���{����/_��i�^�wqq���d������b�V���g�y��x||����������;<H���>���Z����E�T����o����b�VABB�c�=Fwww���8v�������CGFF���?�Z�6l�Dd���[W��@ x�s����v�zzz6l�077GQTVV��O>IQ��j����(*::�7��
y����?��c��^|��K�.����?�J�\���{�������566�������$;K���>����O>YZZ�����:,++��y�D"������*����X,�P���G��������$��Zmhh��3g4
�/������111&����X�,y��j����������`ZZ�������b��jeg��l6�L&S�Tl��d2�����1=.�^O�t����+W����3=R��+&`�by�Z���m��������szz�����g���1:^��������|��G�|����;���O�S�L��x?���M�6���;}}}d��������mkk�� %%���^������7F����x���www��������~���~�����H�,�`��r�,W�,W�&***00pvv6??��g�}��w�v��by���orr��={�Z-=�f�UVV������755���_OKK;}���O?M��������xO�8����������~��O<������w{����`t�'O�lnn�(*88����v�������m��1:^,Wd8������gz�r6'N��(jaa�����{{{�p��n���khhX9����������<�����kWzz:EQ������x�f������;9]�h4�������������E��������?0:^,WX���������dgg�����������/���������������gt�����?�|QQ�������V�mii9y���~�;F������j��4m6���~��r�����o��+��LNN����<z��@ ��gOUUUbb�k����{�1:^,WX��������y���}��=���o���N���������m�����`Ttt���c��}��x���FGG��`������:������xi����S���>Y_�`hh���~���}�k���77o�|��Q����H�\a�b�+��+�����������2�L�V����3��{�����-%%��1��,���+`�b��.W����~���*�f�F��Wf�9(((44��x����	r? ��������[VV��H9/p�����x������~�3�Gy$;;��=JJJ��g��������~���wqq9x�`PP���_|�����/<<|]�������
o������}���k�������������'�p��`��{����������_��_'&&����������;�b�O<q�������?�1�����'N�����v�=((��O?��lL�'1��=^??���Nr���~���[o���k�������
<t������������q����~�!��c��{����/��rII��nW*�d`WW���>�.��������^6�����x���n\v���������x���j=|�pnn.=dvvv��m����r�����]
endstream
endobj
23 0 obj
   19380
endobj
15 0 obj
<< /Length 24 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���y|��?~���x��m�6<�4
�iIJH�v�H�I �I[��-����i�%Y6iv�Y���`�
���m@�/����O�}��I�e�����>%[�1�{�L��!@<3`������e��4�y�g>���i�[���������l6?~<  �f����r����*���E����n������������������ZZZ�t1W�����C��R���t��iggg�Ht����Z�������X|��������Q��^XXx��1�UQQQsss�5;X(��t�����N�[�W�\.�N�:%��}���	R�~a����$�233�Z4zaa!..���c�������������f

��� ��RRRn1�4�����ypxx���M&�QE6������|�H����,[ww����������nnn������K�`4e2���������oee%EQ�_m6����^%�H�����[UU��������&;;[$���/��������]�/_�����d�J��j��)�����b��'����������������������/|��S��J������wU�i����������*���/�D���^�/����f��#G>L�099y�G&���t�������������S���a�`6�����Q���xAA�H$���e�@�~OO��f�������O6�������J�5�{uww�D�����YTTt7�A��v����yp9�R)���A�LNN����={��o3�L���SSS������wa�����<833��y����"��}�c��D"����W��9e���=���:;;K�������2-���<�H��e�������������u��	���D��@�555^^^V����G!��������(�����������8�F�][[�����L$'+����)V��h4*�����������JUXXXSS��|��n���(///((P(���7_���X�H�����O����jj��y�YZ��4d��������n�\^PP ���\���kaa!���h�Xjkk)����'��������5����9�R)�E"��K�����Z��lf�����,��+W*++kkk�����������X,j��������h4.�����V[TTTRR��9���r|~�_�������F��l����W��{���<8;;[UUUXX��j�Wt7�*�kj��L&��%����������u:���7�Luuu���J��l6�A

�������,!� JJJ�����;���?r��H$���a����K"����C���h�O��� ;W]]�gU�T�����E555��������#�`YY�*�.�A�������?/�������"�������d�i�����D����4�+++�Vt��1�\�ZK��V�


�D���.\8r������U��.��^UU���G�F&����39"���e����~������?|��H$
���>q��=:99�,�a�������V�o��6�������+�l���z�����$�������D"�LVPP �O�<�����t
�sCCCsrr��;'�JJJ������D����s��y{{�uuu���"����Oz���tww79+(��'�TJF����K����J��"���;��f��l���"�(,,,%%%00�������jUU����E"Q}}}@@��'��:t�]K_�r����eee�/_��^������HOOOMM]�OSSS~~~��������������;��r��������l6'O����g��M�
��������G���D"r�`��D"QdddBB���O�8��u���!ccN�8�sUYY�����g��h����wqq�DCCCd�J�:s��H$���#SZZZ����=*�


��b���������&����rWW���O'''���<x�iV����m5;;;??����,��h����b�X*����������Wmvvv�Vl�i4�H�����I~b�C�����7���0����O����D������#G���$�V
�J2;;��www;;;:t��+M��<Hxyy1y�����D>>>�h�j�J$�H���M=REv���t��HWKdd$).,,<x����F�|����2JJJJNN&gf������Y���x������:u���HQ���k@@�r�v��������f3�9:���{�5._�,���Z�fzEyxx0y�8q�����Kjfr~�T�����E"QGG�2�����L�I�ZTT����H///������Fb466.������c�5K������2y-..�����+����#�k��.�JE"Qaa!��[�*��B�t��I��@1G]]]���H~G�H�n��C����G�MHH �a��C�49pb� M��<H�<yR$�L&f��b!9�����N��dk������<x� ���au�����H��F�E�U������f�9r$>>~�U+--����+D[[���c�l�d1��C����Ddy\]]���������h�v��'����8���b�B���V����H$:x� �&�������������<HR@JJ
3���A$��������E�>������b�LGG���'��[���*�D���V�]������n�����O��7�D"WWW�GFF���+�A����?{��y��;bjf��~�����QD"QDD���


e��A8�b�zyy1�����u5S����������vvv^~���������z��=���Av����wy���������G�����G�BQ��������"�0��
i��gG���V�D����OXX��<H�"�Y$�=��G�-0-;��N������=�h4d����

b�N�T*�f�e���8p������f�������.i0�?��_sssdQ��Xxx����ERR�������J����\XX���:p�@yyyTT���G��K�&H\�S��9����� �7��GGG3S�V� '��=_lda��V���������x�yp����x���������)$�8dbu���������(777�X��������#���L���>w���������D�����j���'o�Wr����������A��Av[�V�_.�*��.�)�P��PfYH�I�����8�q[yp���|�Y�<x��Qgg���if���S��&y����}�����"'��tp�
�B�������q��c�Hg�1��>0c��"���������_�p����������n����������9;;gdd������7:pgH\�SH�
{�m�A�@��{=�U�;�L�(���*88������/00���7��$�0��DNN�5q�ypIeN�4�$^}}����===}}}�)��yP*��hQW+��zR���i��.���h6�I��d����%�I��K����ora��nOKK#�����R�V'�LY>��V�
��K���]I�KN/UUU����o<��)�"]��1�d�:r���L�����������n2h���7333...==}``�t���2�Z.9~#}��������I�\^TTD�������Cf�Y�P��n�6��n�w�n�����"�(11�9��y{�����#���]\\$I�u����C�Hq�c���A�&!�`hh(s�F��/'_�<H�wv�� �
o+2H�D���,��HC~I{�������������x2V�����<x+[��k�0��C� jEE{"��o+���0�q�<x��Z�s��A�d�3�e���\��O���O��o '7nt^��(���"��I���NZ��O2��[������r�8�;��dX)�b
�t�<H�������h����� ��W�IU�����m E���dN�Q��Ar���A������J$z{{�����	f"��=w����266�����y###�e.��<x+[��� ���4���y�%�ao7���������a�Z�����oCt�<�hmm=t�����������[�V77�0���F����444xxx�
��*L<y����� ��[�����K0��|$�S�Z��b��A��������~�}������$i�&&&�"i�2��A���WlG����7����d�3�� '0���o%�d2�
mHS���WVV�D��/��������/����}��l?��q+y�V���A����4�t��'-�%=V�j��'�|�a�&���=������<H>�}-!{�n�)�
'��!Hs{���,��$G��z
��N�;����%��,���'�����yQ//�/L�������e��A2��0�...b��\�l�Z�R)������N�B�L!y���bis�}�n�gdd��p���IO��k�����\��Te���$;��d
`_�@Zd�(���������a.1X�p����(���������4[�i���x�u�nd�������f$2	T,�����v�'����|J����rl'O�<{�,s�B��H.p`'b�[0��[�*V\S�WA��� �Ee���;����|o---�J����\t#��o<L��Y����!�{��i������F������>���Q2ehh(  ��� ����dz^^3%::�����u����xf���l

:t����+""�l����b�<����Q,3����	gggr2��*�����T*R�������3��!u`pp0��"g���tiW�D������������
r.���W�P�]��A77���br�(fhwFF9_���N�����9RYYI����}	r�q�*krr����\�H��.�A��s�����<xp���=<<\\\��9�������\5LN������yww����%�
��l�����������BZC��S{�5:|��L&���������$�+>\^^N�v�=55�T)����O�vuue��������$W�9rD.�3�	��III$����������GGG�����^�f���}||V�'I��]�/G��������			����$���Xqq1����L����K�G��Va�X�kZQQ���8E�s�����N�:z������ �d������������up���7���h�	�K�.-�x��2��d� w��z����������hll����=��]WWg�Z�j5�_#c`T*s��V�������O�<I�@J�u�d�J�b�|qqq...2�,//O"�<x��wo]QQ���sxxxBB��#G����mO���������SIIIG����Y��E�:uj�)�����VUUEFF;v��;���F�i�`�Zh�Z2eIOzOOOIIIii)y��` o���^�CGN�LNN2SHMNQ3������+**������Hk�����(J"������9����%K����������,))Y��j333
�����4����7dc��F,����<����Yuu5����%_oGG��O�W1<<��W������Y�_L�A�R����
���%�,���YR��,�|)q��WW�/|6����]UUs<C���F�������M�
��o�+����/j�:�\mmmAAAmm��l���g>���g�Lr,D6�����Y 'O����@��2�3`n���X,Kv���|v+xvv��N.��7�?��+EQz�������P�R�����K �R�Q����8�9���o�������������^TT�P(��)���\HH�������FV�V�k�\.wuu��M��n4~�d��JKK�_Gp�������H$7�	sAAAuu5�%�����`���
�x5+{�[�8 G':;;�o�\p��h4�g2 �&CO�f�q�1���J��N�����
r;�����Y__���}����jhhhlld?��F��455-y&����F��j��s�������z�NG�o�Z�o�<@����W|�=�%������)))sssyyy/^\�G��"�����/_�����������<���EGG���455EEE1w�lmm���������111��'���bbb���t:]LL��T�N'�J;;;U*U||��h�(jddD*�FDD�����������~���������d�s��f0$	i������p�S�	�L�<���:99��"������\�t�y4�Z�&�KJJrrrh�^XX���`�����<����O"��[������� 0v4��������/66V"������������-�J���RSS%Iqq1����������L�LF����L&������NII������%O�fS*��F���/��1EQ����������+�Hfgggff$	�������p��N�9sV���5,,�j����H$�4i]]]||<3��������������	k;vvvFFF�8�^���������.]"'N�Zmdd$y����TXX���I����������������������X�R�d�Z-�S���������=QQQ����5��+W���f������6��]]]��y�,i�MLLh�Z�D��tmnn�H$�|M&STT��|�p�*..������)++���b�n��������2���J������ ������z�Tj��d21�ihhH"�0�����+((X�sssqqqyyyz����2&&fyT�TQQQ���---Ls���4>>���E���d2����������a�Z�|&���g[WWW�����A����d���������T����I�T^�p��������<f:EQ������L����"::��������LJ***���_����MMM�fll,##�9������V������b�����tuu�����������P��F���
�y�[i^�x1---%%��%�����H$�����h���a���t�QB�P��+K�bdP�t��]>�T�P����4=99��lii	��l$!���R)3�K�.


������j���,�%���������%Ea������0S���DFF�3�E�n;���\�g4I��� ����+�������^2�n�'&&�t:�T.\P�T�
����b���joog>�������t
������b:I`m#����NNN��,qqq���L*���`�/ ��I�������^�FC��(����MJJ��tj�:55�r�F�!�#�4ZTT$��L&{,Kbb�^����HLLd�k`������������HOOg��+**����������|�2�H���������hnn&kG����H�Rr*5!!ann�f�uww�������'�j233�
�<��d�0���Y�V���i:66��n�n�+���b�����H�npp�y�D�A�T�?�����j�d��rV�������/�7��:����~��p&&&nt?��d}������X,E1���	�fll,11Q����l6������� |+((��d*�
�A�;UVV�d2�h2�bbb���d2������C����-�������`__���Nf�R�������$O���������=�����3�����~����x���>��L|��w������N�:�����q��������G}�<����`������b�x��m�����������w/;��������i�L�a�2�/�y��������������{����w��L�����%	y��<�P(x\-���$2N�8��{����������o���w��I^���o���f��[�N.���[�l����r
���ypzz�g?�Yww7)n��u��/�����[�������}���������|p||���I������msqq�eUn��<h���~������o~����|�M����zk��]��T*��a��drrrR��d���[=<<����������U�o�jy��o_JJ
M�6�m������_���)���Lf�����u+M�?�pVV��i�&�T��z��={�����L���?<<��vwwW]���O�4EQo���_��W��kkk�b1M��f������4M���KQQQ$��3��������n�V����������;�����kOO��j��W���W}��prr��t���[�l��o~�e��;v����4��&=w���-[^y����wSE���h|����o��y����<�W�18�������L���N�i�����UXEQ�S]Arqd���t��to����b�2��0#�z;��2=?aZ���8o1�O��C<n����a�U�W����[
���Mu�e���5��V��"��;
���-Arqzc��f��,�c�\���y��m�!H.�������)
�e�=�w�����iHE�oQ�[V?P5=?)���������������i���������uPnK"oE��o��d7K��bEw!�yP�$w�����8��������j�_C���1��������*�+�������Q��p�X�$���OO��p���������S����4mD]oY���0�Z��T�W
�����54�/T��=���.��3t1d@)�y�M���i�u}���K���:�0X�����Q��s� ��z';B��a�e��A�O�k��wR����������z����B����7��/�8��eX������P���[�W��<�yQ��f�����Y�t��arn��b�TWb}��d0f�S�nM43N��d�^P��$u����U}�\M�W$�h��$7
�x����{H��F��*�H'�sUWA�:���"�E��Ry�}D��v5��A�G�t�5^� �����L?M����C���.��sZT�WL���Nv
��|�{-�&��\�12;�0������g�S3�)�������&�W��z?�;a�[
����S�F��W#NC�/Oo�*�g�P���)��q��B���������T�]]�uq�4$G��	^� �zpS18��6��3�:H.���k���YM��+J���^�����3��l��S6RL�?]0�bd���#
�r����tW����]�@�J�[Kph�4Ry����4�$
/vK��b�D��_no|wF wBW]��mD[���_��oR��(;�Z�."�{7���V��t������B�z`5!"��
��-�US��J��\��@ �BW]�&�!,H.V���[!�{(���V���@���WA �BUY���2��=������+))�n��)V�U*�zyyUTT,�������`__���Nf�R�������4�����u�?�@ n7����������x��w�y����s����������~J&�����O>�����/��������q��������G}����������7VVV���m����f <�Aq��M�w�^&R��#�\�|y�R����������������i:##��'�`Hxx�s�=G^������?�iz������i����(
���b�-$7���N���[!�{(����y����������y��L��c��)z����itt���?�p����uxx��M�l6��u��r9��e�???~�	�����]�6����e5�G���8-�[L�^q����S��
~+�����J�rrr���!�'�|�����<��222���������������O^'''?������NNNZ��g7o�������u�k*���bU_9�Y��T�\�����=�!7\nn|�B �VUU��`SS������)>������!!!O?�4�200���d0����v��E^K��
6�L&'''�ZM&n�����c�|���TE�y�W�B�_��_��[��S�z�:��"�@�Ap��V���sss_���I������������x��(��������oZ,������|�M����o���4M?���YYYd��M��R�kvwiR��Ng�bI�{�-H.��h\���C��(������<�Vz&i$MC*������Z�s{��y��������~����[������?�Z�f����"c]>�c���kkk�b1M��f���d�K/�E���}�����p���ONN
�vw��������������(H.���X��1�+�g�5F����<�������.F����4
�������������?�������������4=44�������_��e�J�"����/�������m�6r�444t��
�������e�+���{�n�l4��������7o����������f�����of����8����H�����������a����X{!t=
����|C��H�����4����$�����>b����4p���0C�.��Fmo�E]L�.��$xAs6�)�a�F�o��B��8�X$������
�%��5'�����|AV�a���3W�_�O���3����C�c-�]��#����Q.D��9��,�]Q������fi�\����(4
�����X��?����U5p���U����x�����SaaG�~����b�:����,�����
��j�.��%K�u1Is.�-E?�,���!t=
�B\��2\_��V�SB�������\;�Z�k���u�������K�IRui1�vf�vf'��9-�h������iZcd�!������i����MC�
�%��u�!�1*�Y�u���N�Z�S\�S|Q�CQ�]���:#Hd��!���[�� �>�!��������2�������a�?�<�>WuD��������+�g���G+��x(2�/��aq�S�
��~"�)���hNKB���iH��	
���;2�����������j������j�:(��'�)�T_	�!4�����bZ�0M?�>��:V���pW����kz���x��j����}��u&�*�� �n�6
��F�e��U��F
q]�0\n�
��
]O�3f5��7F1����x�%�y�>K�qG���X�WZ�]@�vN.���x^q�cT�KE����C8NF?�T�St��H�\\��M���.�|�M���������U�SL�?[?P�2\_��[}��,?E��j�����w�e5�'��M��l�%{�E��"k,
���:s���������OQ��8t7��ii��P�uQ7��W��Qf4��;#�;�����Z�8�i>�BW��-A�;Fu�=E-�����+/H.���%�`���t���;2���B�z�%H�n��=d
�Y����QW;N����"��;C�� �����	5N&^u&H.�e�����j����d����o�q+!tU
p�����&�������cY��D�*��<M{A��G����n	u=��Z���`r�g������{��k,����[���X���O�	|�^���IN�O�3���J�$\�����=����i�c�Y��'L�Y��W�{��%����m8y����7o��������/+�)JK�.t
������l>������B�~��Z^(��|�v��7	8(;e3.�P��j���h�����x���iQw���-!�1������:s�A�����S�����QU������F�Gg��B��	q�83_���|o!���>�����,���C��j{K�Ug�Gu�����������>�+6�����<�4�����4��.w8�Al�:uO�a��.!f��6��6�gr��z>�8��,�F��C��R����j!��%���W�I�~#������k_hwE�����/�=�?�g�'�_��V���i5�}1B~��E���f�Ti�o�k;�Nah������y�"k���iML}�)?~q���<�Y�k>0�3��1���[����K��kVQ �v��C��(�&
�'��!���$���$<RZ��*�lF�d��#Ki�.�>�;u��/����{���]�/-|;_�!t~p������W�_|3[��<��"�����C(hKMT����<�����U������s�%|�#RT~�1��5EDp��|��)�!��N7���$,�������l�_r����'bm;\�����-�k3/�x=�<�NA�q2|��De���!���b������Q	_"�F	v��b��t��9B�G�8�d|sT�&AAbW�h,N�"V/�3�n�{���u����/E#C\�g��X��M�OdL�9��}��bo������J�E�r���I(D���W-��c�������'��
;A���@�y�����8y(y��p�
v�&��j��5���� "�n���&��
��H|-�Om:E8�'�<�@���^c���t��z������������k�?�2\�������+�*K;�s[��]~��L�R\�<�&�M:�Z�G�C����������Q��I���]Br}
o����CD ���k^VS|�:(C��Y�w����!�|��c�'�A�R���<��Qn-l$T*����^L��TD V+����j� �8�A��/{�1�%�[��n��1�w�5ySS����_nlX0[��d���R�\F�1����p]L��1�Z�A>�j�����'�,f7��WK���|�>ow�����s����q[��U��7f�_=�E�ki�||������]��.p]�U�Jus���B�������U�$7�{'�����_���������t��p<dm����X��}�%���M�N�<D���Qkj���$�s�Z��M���u�u��;j�o�kGp�r!��#t^������X�j�b�*5��RWW��_������}}};;;��J����'22�l6�8���F��~���lR�S�?`�W���*k1.T.��l�&��^K���8�v����4���y�f�=������~W}���nii�i��w�%S>���S�N���d2m��1,,�����Gmoo_�mKA���+++�b��m�V�W�d'������M�M���k22�G�������H�/^�7)���3s�r�G��_nL����j�
3j�(�(��sx3��u�����V4-�0�+����bIMM%��j���>K^��/!/�������s���{������i���y�D"�i�n�?��
�b�����dlj#��76f)k�����;5?���U��~�YJc�f.���9d%5�La�/���P�;f-h�o�^ ������rci�<��KW�
s����dT������cf������w�v��u��i������_��m~����;�����M�6�l�u����r2q��-~~~��Q�<��Qs��q2���������[���>hi��F���_nT��+�����y(�v���:bb6���
�-\5����u�Z�X7�����=bM���F<����2��������������u��;v���[�n���c�s������'����|����q'''�VK&n�����e�,���?���X�4�Xz�'��>����b���P;_��'���@�z���3j�6�&�|p��nCjj�������{��7�|�=������ky-�J7l�`2������C=<<����.�&�����W��.���|�+"G�g\���� �/���<������;��?���������`Oqwwg2������[i�~��������M�6I����q2<��j�_�m�Q�1"Eu�Y���#7	�N

}������#E�J���O�4EQo���_��W��kkk�b�b�N�Y�~���,M�/��RTTM����#f����?99�|.'�C$�k<?�E4�Y����������
|��Lqppp��-���o�l��c������CCC7l�@�p���-[����+�w��(��i����k�m��}���yyy+�E�q2�+O��\>���-���3y0����g�����A �<
��c�p�^��,w�$�4vGu����Q��'��s�A�KE�~!�����pB�%������{��g��~���]0���Jc�n�q.�F ����y�s�#��y�o���s}�6:'����?	z�����/�_�.Xw��`������@�5�����3�l1S���x("���<x$�J@q@�\�/��/s�t�?|y*�tF�e&W����������������s��x�����WA�u1��H�A�e��P��(�f�B�����V�S\���>�H�TeW>��C�W�iy���vE
���D�������{3�5�Ce�����B�?E���]o��,>o��b~��20�y�;5'��bEw�f@�P�U��x0iq������q���A��T�wgq�7�x���\{�EA����Z�<������Z�t]4��.Fy���kb���/����������@�Q���B�%�u\��>���\�I���=���@D ��0���Ms�B�W\���7���I�FY���D w���U#r��� �� �{%0N�7�1N�@�CQ���9B�%�9T�U�p�kq��~�rImJ����hf�����y�o����������2�����
OKG �RT����`J���l���(t^���y�B�b�]M���S����qrW,jDD �Rt\��:{��j�Y����K|+���!����b�<�6,��X��o�����qF�@ �8��K|sK���'5�z�������.����������������%M�U
=���y
�D^���	�����I��3xb��k�������|�eb�>1kI�8-�i-��8��!t^�[x�4oO$q.o*�j�s���`��R��y�tA#EG��[E��'��1d'[���r<����J
���*�YHx�Ms�_���46�Yz�(���_�6~(1�N�G��uz+��I�W���[����j;m����Na�=�G��nk��c{��]|k8����%HL�z��A��z�l�D��e����?n�7���mqh�.o��;��|���������������/{3U��������������re�ve��Q�����lN�<�A�$^�4�F�����m��)>����|3Y��R��gL%��������{�;�j�|�������J!�E_����4�y�h���$��l��8�@	�C!8
~w �	�s�&G�\�
������"K�6�j�"K^�6r]�������#���������;��n�4=6cd����[l|oZ���,�B}�c��<oQ�C�!�	���.;E��w���SW&�����v�����m��'��-`�?����y��j�#�	�7*a9�8����t��FK�(����/�������������������*��/���%�B���f�K��#J�9j���_#�+���Z+N���s���l��E	N�q2���9�;�?���������c��8,��,���4Y���ye�Sc3������~�f��p]<�ede�����(�D�����O�����.�&��`4�H�������[���������-5��U�y��~*�-o�2�`4^�:\����1�w�SY�uo�\Z����K���r�&@����ZKZ��l��*Kmy��x,��iq�J��)M�0U���A6*%]MIyS�v1��9-���Z�,�^a.����&�+�1��R����1;�E�t���[-M}���h!����\���q��&����(T8x
���8�5Bob|r�[�4�Z{�g���U`���71�9T���/��g�	w��k;�����PypO���A���A�/ypm����*�����`-�8�5Bob|*���v�[<K�����@���]1Wc1�#	|Y�3��H+�?�fC�M�oB���)[x���T�z�s�<���F����#��������O�5Ay���k�AqW,m�������������B���ye��f5=g?}y�h6�|az��i���"s/�!�&�7A�`y�Ei�"������]���bI����9.Z�%����'H���=����-��O�;�-S��X����%�����K�?����2aDD\}�/d�i��g���O�.>������n�Q�sH�p�m�����6��A\e������	#�s����Z*���5&�T�C��d22�j�J�R//������2>>�������LT*�>>>���f�0�fXh:Z���uvv������~�����;v|��'������/�����7�L7n+((x��G���i�.((��qcee�X,��m�@��:�?�Pyp��������N��i���c��u���4Mgdd<�������?��s����{?��c��7o�,�Hh����<��B�bUV��C��;v��b����>��c��^�wrre�����������i��f[�n�\.'�l������J��u�<���~���������eee4M<��3��###NNN������o���~�:99��wrr�j�d��m�\\\�X�U�q2��������!!!O?�4�800���d0�����[�v�"��R��
L&����Z�&�n��������w������}
;KkZ����/��V���;99MMM���?���x�A]]�7��M���������7�$�����n�J���?���E&n��I*�
�������������H�;s���O>I���l~����X�����������Z�XL��F�Y�~=E��K/EEE�4�o�>2`�`0������B���Ct�<h�����??�������~��������r���^x����m�611A�thh��
�_��;�e��W^ye���$����^{m����7o���t����q�<K8�8��Q[^������;.)
�S�
�=(J�[�F�B���2G�,o��^�����L���g26�q^������4���	#[�p2w�gT��O���=p����?Ow��=��#Z�/{�I����?n� 	&)�������k�C��E����z���_'��)�RwYw�7���<��c�/{�j-4�p��Q����5��-���Qt��t}�����N��@� �
���<wA� �����r��H�|�����'w��������8pD��x�B�^�q2F3�@xVj`���X���_n��0��x iN�k�����`�Q�WO�&V-�v���u�H���+
�������;�8-��XM8)
���������>��ysyy9����o��U_���BBB��KHH�����n���g���v��ggg_{��?���/��Bnn��p'6n����C���s��z�)2q���E___LL��j�z������E����aaa4M�����O>�i�`0�w�}���B�
�m�j�SSS4MK�����gd�����������7�$������_�4��Cegg��O=�T\\/��j�y����iz��-G�������Z����zk��]�uBB��
�F����Z�&�n�*��]v�/%++��w��(�/]����������P"����}����������>��������V�%�m�����������W_}u~~~��|||�}�Y���>�h����uxx�O�S���n�:�\N&n������G
(��&������W_}uzz��������>�����A�z��������i�f�MNN�4��_���u��=�~�)M�/���������������
�N��n�?��S...~~~�����4MII�3�<STT�����#��� N�<y�}��4m2��|����������G���4Mm��Q�P���W_}U���%v�����i������'00��������-<<��������5�G)�J���(��,���j���x��������?��?�����5$$d���[�n�����v;M�����������^x!77W��X����������i�����������Z�^�~���,EQ/��rXXM�{�����Oh�6��w����p�������_���V*����7,�W77�7�|��������M��C=���M&>��SqqqB,8�*~�������A''���N��o����]������
6�F'''�ZM&n��U,��������<�y=22��������u������'����|����	'''�VK&n�����E�Xqqq�=�y�������Ce>����;w�����?��Om6��u��r9��y�f��+�H��N"����n[gg��u�fffh��x���O>I���f����i:22�W��y��={>��S��_|�������}�{����z%��|����U�T����LKK#'K���>��M&��O>^XX����tm��Q�P���W_}U���R�V�T*���b�v������������������R���e6�Zj�{��j%/&&&"##�����o__�L&;{�ldddee��f�g�lv�]�R�t:>gj6�=<<�������h$/fgg333���GFF��)��
��]��]�y&����_��W���k�MNN>����>��O~����N�k��>�������<���O>�����[�����R����G}�a����{����Lqwwwrr�t�---�,III/��Rjj�����q:_�Ry��������������������?���lW����������DDD���OOO���>������>EQV����_�t�.\��s���3�n����������799��7���d'N�x����#�o|������������O?������'�x������o?��c�����|�;���@�t``����)��i������_�t����tlW�����v�h���h�^XX������.2���G�������O�z{�����������[o����B��?��?r:_��������L.�����c����gKKKi�hiia������t����]q�?��������cGii�����_|q��}sss]]]����8�oqq���?������?33311���x���������7**������H�n����/<<��������7��|�2S_q��266�������S"�������,>>���_���8�/�+lW\�v��v�8(���{��?��;����~��'_��W��[G�:8�������}�k�������35�L��>��c�D��~�����a�Z�?�����������?�����+_�����q�FOO���yNg��
��]��]9�����O�EQ���*����inn�������������$>�H&y�#�+`����nW�������2,��``*
����b�_���(yEQ����.**�z�������b�wlrr�'?�����#�<���Aj������c�k{�CCC���w�������~��_|����

]��~�d�w�y�������b����H$'N����y��W����t��`�k{��������{������������������?*++�x���o��=?��1_���y{{�����E����n�s]Ob�k{�^^^����6���.3��w�Y���������,�x�����c����?�<{���vh!EQ�*�B/4A	�0�A"���0
O&+6Y�$����#����q����������45M�,K)e]����{UU���39��e��MW��RJ����IJ)���{�y�]����Y����z�Gv��p"o
endstream
endobj
24 0 obj
   20311
endobj
16 0 obj
<< /Length 25 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wt��?|&���d7q�X�qk7��Y������-y���x�D^w�':�Wk�Y�X/XJ�HJb��
�{�(�7�`@;	V�w$�6���<�%EYmf$��9��C���{��������`QQ����KHH8v����������/
333�z_+���fffn��&����ijj��?==���!�"??�6������t��	��RSS���999��������w�=�����bwwwgg��g����Q%��}}}�����������������`��/]������j��������AKK�5����W$������(
���M����6m����@���ZQ^^~������yr[[������"//����Dii�
�A���������T4����V����U ddd�o����~o*�����������WWW�������UO�Z�����N�rrr:q�Dnn��d��j��KJJ<�8���� �����=j4����B~~�@ ���d.���sww_/��j''�����+���9��K]]]bb���:�N*�Z�V�BOO�U-�������Uu���ASS��>3///--�f__�R��y������ EQg���Sy���R 477�|Q^^���L���`��������.6�����������GGG�F�|Nrr�@ (//���nnnvss

�$�755�����f�.((�|�����'����%���������%%%���G�
�����ngg���n����������yp����@��=����l�;�K��o�����x��������>>>w*�����\x��I''��q�?^ ��zR\ZZrww���0��dIww�@ HHH����pIk�����WWW��F�D�RI������������f�F �J��������������UUU���j�}����,��������Ez��b1r��������<ayyY�T������-..�z'6����������X.������x�@���C/YZZJMM%����`aa���#�����


ds���(�������,//���X{����kjj������VUV��<11a��KJJ�J��j���~~~�9���###�'IN�,//k4OOO�@���M�%��V����������R�T:22��SII	����1����hJJJd2��S~�����[]^^N�����(�����877w�������������R�F�l��<x��E���R��nc0��B__�T*-..�����\___RR��h�266&�JKKK��byy�|_�������7��f�5w���	����� kkk������.**"/���Q�������H��rz�KKK*����T&���LQTkk+y2�E���&��?����lnnN ?~��055U ��jRLIIa��#���H��������1i���I���yyyw���&�V�I����OII9q�������Y3��VSS������/\�����\]]MF8���yzz���?~\ ��������K �<y�Y�Z���� �@w�����@''��4"�V��������b�����������ttt�m<}�tgg�P(���#(����d����/^�#�����677����L���r��I2�����n�zyyI����.))qqq�����D"!�4���$Iaa!�������������9   ##C,���4:�����={V TWW�D"oooR����2;�ggg}}}���+**�����o2DG ���\�3lnn������L&SHH���_YYYAA���KHH]��<���{����Www���!���������/������uuu�������5,,������Q ����N�����SRR�?~��	f����]�Gt������[��G����_������3��'C�E��������dIoooee����@*�`�A��n���?~���+--�����������O}}}��������.
AXX��|����E�:�K�NQ�@����\X]]��RN�:%�Y�n����:::��������H�HNN��������7���5<<,�;���Oj���>GG�c��1�������S�������M:@,�H$"�<�������\�����.FGG�������G===�{�###$=���k�A��������DzKJJ���GGG���xyy�������F��fKOO�.]��s���c���I�l2�M>��������^WWG������`m�h\\��/��U�����y�������9E&�	�:������d2����Y �����vvv^ZZ"�����I��N�!��k������9���t�G�A�PH�����G���nwvv���'o�f��>}����Yk�Z��_�|�����/��5�J%s����rvv����w{r�QYYI�}}}$��Tttt�D�����C�At$=�<H���
�U���E�F����O�>M�8���<�n�����O�:%�L&Y�X,��o�F��k���Z����q�@����)`��[�Vr�Bw���gxx�$�'N���k%�����.�l[VV&�Jggg����?H����G�{`hh�@ `�2]�|�������������PpjjJ �={��R������4X�;�������`�f���!����LLLnnn@@��Ji-����69����x��Iz���@MM
]$k{x���8��Z����X/&%%�H$������d���l�c��	�B��'�Jg��a�0����H\\����i����j4��n'u�u�����0?&r8Do���P{{;�W��3������������d���f� [���L�k||���;w��@ `��2S�[�����b7����C:��;����-�Ar�C�n:�$''�KN�<)H?��nW*��[��o���l6r�������E�.,,GG�U�H~�$�������:����evv�����
HKK;z�(I�7�T��~t�����ld&R�����K,��[sE�n���+�7��3���y���������%tlkk�z/E�����S���".^���,$��=<<�6��x���ptt\5����t-J�������t5�����^��\099y�����:::�w0�V���S�����!�������UM���D�@@����=d2���������iF������#"�3n6��|V�<w6�_�Z��7j�A#��k{Bn��e������G����������##g����k����a2,aff���QQQ�����#�FDDDVV��drss#���juww'-\���"�f%|Sy�yE��f��HH/���555����������������W$��Q������kO���r�@ �H�{)�������I�������(�3g�����y������<H�h����|5����������S�����E�����J�rm�H�.�:H�)��_�X\\���D��W]$r��[cccxx������O`` �^fm\��NOO���WQQA�-�g�h�������Fb��sg� ��]����~��k����7N,..�������������������nr~��3��D�����K�����;��������gff��RZZJ�����������<XPP@�n�������=�l�k�A���t��F~A����y������1  �� ������&����x#y�������$[%$$����������/�799���A����u���o�&""�n[��\��Hd�]�����\��[����y������d�����B��z��7N����k"�t�9�Z��N)�;�'��)2~�1K>+L�7��`SS�@ XU��'sky�+eV�?d�
�?[o �u����
�~*25Gdd�z/E��U���m#�oBB���_��A^� EQ'N�ptt\u0L�wU����Ab||�|#�9@���I_��>X����u:�����H�-}���Ar���$�>z#y���������!�Z��j��������y�<��Wog����4D����5��;��lr��.2�H�F��'��
������J2f������<H������Q���...���������nnn�Ub��Arz�y�UUU�@  �	��Te�z=��n���"ER�3��'��o0������+ /�����W&y�Y��l�����B:9�H$�B�%;��%�v������MLL899]����������[z����ARd� H7���A�|���=������j(Qvv�-���;s����p��uky���W����l�c�����~���l63�a��������9;I�m�����7�����������������_n�^�_{��T���{����I(�0��"�H�;�jT'��`&��k"�<H���I�m��rrrHr�������F��*n��_����c����=�������G�^���}���%o�f���t�$������bcc#�J�����2A�[�UgF��Gfw��x�e���	f�I.{����+pR7^�x�~�F�Y{���A��<   44�~r�����A�z=�r��:������St�L���!/N����^A��H��Lsj������Ev�&�5�� -�s��������2���
�3�o�F&�!<==�����<���N~���������� ��"�0���v;���Y��L&�Rz�1			�f���k���o�v\�x����j����:;;3��$3��<y��6m6[hh����5Ga


999�'���V������BaLL�y��(�Jrb����L0B��<��^RRB���������K�A|||{{;�Lz8}||�r9��Htqq)++���"
���riFgg'i�8qB&���J�����A����Q���G�R]�|�����������0���9v����'�����9s��Qf�Y���
������������+��={�����D�����#�}����caa!�"���^���233�Z���Pii)9ev�����Z�����!�������t��V__O�GGG�K�.�����:WW������l2:=11��{V������=z����> y0::��##��bqQQQRR�����V�H$rtt���*����� E��:99edd��������v��TSSC>�c��UUU�->�0!!�|�������NNN���������'��b��t�����-uss+//_�EWTT8_x��)rM:����q����)OOO�!O^m�
't:��!$$D�T2w���Pf��f#]
������nnn*��4l

�������	������:�����J&�'Y^�P��z��L�~��������`�F�:z��R(��_�����
KNN>u�����������������iiinnn�~�����U���d2''��������Q����9M���V���T�|~yyyEEy�N����+������������vz	}i���TuuuYY9D��[oi��N��<��R���X;����bCCCqqq]]���IH�^��2>>N����������]Z��n�///�U��z�3�n9�,�����fzz�9gA��GGG+++����Z-s�;������������T�<�f6�W�]c����%��n���d�����U$d4>�����9~����������dJ4�M��U�����O�����y�\^ZZJ�5;;K�@���U#����fffjkk������d2�T*�	����Cu�DZ�d�)--mkk[��o��ZZZ��v�7�T*��www��������r�����e�FSVVVQQ���I�q�K����:������,..��X\kiiI�P�����"�b���	�����z��)���bHH����;8U������B!i�^�	���...������ZQQ��H��C\����Yb������7l===��[o-X��9C���1��]{�;.
�y6�NG���y��?�`���:k��pwR*��/455
�2U��f���ill���[�
���677k4�U3cOLL455���,--1�OMM555���US����566j�Z�|��B�>����H�����BfffFF������taa��8��n����_�|�������kS��� �����%&&��Z���=!!������1..��1Aooo\\\GG�V��������j�����G�T&&&�;���K$���(2���R\\\cc#�������)����E"}/$n�t:�HDZg&�I,3o6J����w����MKK#����X�v~�����wv�]"����-//'�Z^^����o���K�lnppP$���������p���3f4"����upp0>>^$��������788���/�H���233E"}7������������t�P�z}zzzLL��gLLLAA��,�B�`�0";;�������A��s``@$-,,����D"���---b��f��lN�j���GFFZ,���~�HDw�6440o,^RR�����������M�F���===����\ww73k�d�����.]�D:N�jutt4�W���ldd���(I����dN~�HT[[k��������
��7�V�I�"�������U�������'��k����L&��������{{{E"}O�����V��"��yOR�HD��h4����\���w���������������<�&�6�-!!��uLOOg��������Hf�(���Q"���0�tz�D�}r���W�������������n�L�6*�����F������C7�***����Zmzz:������N�����J�����_��M�}�kkk#""������w���������������L��
EJJ��n���,,,����������|�mU]][����G�A:%����}KKK---MMM���999t_+����J��������Dsss������T�����������J���	t?���/\���������m�`���E���D"QSS�'��9<<\WW�<��j�QB.���+��`������8�T�5�!�k�������L��fff�����"##�V+I������^��K�FGG'&&"""��u�H��+++������YYY���A_L1??MzD�v;9m���"""�+�iR�`��Q����/\�	d�gmm���6�-99Y���w����T*�
$�Y�����I��L&#��Z�\.'�kjj����������gff�P������L:GEE���k�$I���@SS9�g��

RSS�Z�J����4_���D�G i���4==�h42���lNNN������JNN��k`���������������+--mhh(**�|�2�H������jmm%[G����I$��������h�Z����bq?�w2�&77�`�c�'C��,,,��jz&EQ����u�6�M���+�T]KKi�
�O#�H��)
��������j���2]�b����������M&�V�mll\;����z���7I�wnn�~�f��n�����6������d^&[���d���lkk���p���8==]�T�1w'�L���N�Fc\\���gzz:}�����c_hhhX�
SSS���>>>===�B�B���M��p����tqqy����y�z����O?������7�������w�}����>������1���o���,..���~D��[\\�}�v�L&
������|���������3�����[ZZ(�JOO��mY�����<X^^���`��X,~��g�������E���S$�[�?��r�����	�� ���3���y������X��?�p�����X,��c��j��e�T*%w���������k���������}}}��{��}��=����w�d>s���G�!����|����)�ZM��������M�ik���f{�������>����{��7�K�z�����d��mF����A�R���w�vww_�Rz�^��^G���Z�?��������V�������_����%nnntf�����{7EQ?�p^^Y�c��D��v��C�����t���O,��nnn�+���(����o���_��W������B!EQMMM[�n]XX�(��_���!�����t������pm�����?��w��gg���~�����~����7������j���w�����/���k��}���EEDD��I���v����+�<x�n�Se0^{���{���������
.�-e���u��)�z��6+EQ��������W����	�k��{�����q|xn�I_$�6��
pP�o�L/NL/N(��)6���B�P5��:�-j����BYo�n��c��a��I/�M��]��/��/%�&��r{ZM_1�Ey_Ih�Gqg��n��������$�1T1XE2�������<��:�EF�yOO\��a��<?8�[4�'0i��w��EQC����&�#K$��5.��KfcnK<E�����<�0\�_�3��q����U�n�����g%��kB�����
�{�/��$���.�.����6���t~��b��<\$^lK"��+�.�,N.[M��>`#��,C�}%]�GR��Z�Km
��l�=7m�(��
�
�gIQu�8��b���9X��b�d�r�z�0���
���y_I�X��5q���]�V�����{F�6.H*��/��C n?J�.�5����jH���@_a�@��1�4���
�1#������A�Y���
���������*�g��t��JZGU��`��GnKB�T�������{F�Z$6T�^�!�'pO[0����t��;��3�j��ZlQ����j�k0�������f�u��������������� �P?�bqlA�;��$��H1�8���n���$�]�T�;'4��\��T�S����+�������p�kelA_��I�1d��odn��:����bkR�T���&�d>[�|��<���	���@�������_-�������
��|W!��mI�RG-[��.�4��4�jF�y�@�����lv��n�0��
����
������.F����f��|UY2�,==��������355�f�z�I��"�H<==��������Tpp���OOO�P�Px{{GGG�L����s�=��6�ul��r�`U�@%=���b�p��bq;�}e599����������;�����g�}��������?��S�p��}�|�I}}��~����T�������GFF��G?����(���x���2�L(�����-�e4GI���L�F���*:::66����t����<�����)��������>44����e����������<�������>�,y|����?�����;w�D���1�����1&vs�����+n�����_U3888

��#�<��������%�����������'��b��;�V��-[������������m����y�M!�{+����yP�T:88�����O<������O�%��������������#�qZZ��>855����V���={�899q�M�'��f�,
�����B �V�Uk1�`KK���������{����������z�,���:������z�����D"��m��htppP�Td���������W��+�C1�����)��(�o����<��.�@ n<�z=����ypqq�k_��F�!�����o�d����x�n�S�����o~�l6��������o������w��(������#w��!�H��2���F��22?80�Y{*H*�o��f�]���u|��������Co��6]|���O�:EQTEE�O~���b2�z�!2���������(���^(R����u�V2���_����(���?'ft:����?33���mrm���s�Q��A;���2`��������F�qss��~���|���������������������]��J%y�T*}���_��={��^����m������������W^9x� i6��^{m���;w�,,,�~��Hm
�
UCR�wo�������
hlA����7_@���������RTU'��
Z��.")��2ac�����oo�(���
c�+z�+z�s[8(V�^��l�~c���K:�5#
�X��t�=��b���g�����^�]_���<�M���4�e��y��y���q���,��"���_�
k���TAE�l�;�������������xE �����fQjS8��5������
��;'�1D��7��*]�S\FJcX�T��<G��Y�.���(�6�C�g�4��s[Be���B�P5�_7������
�d��&���-��6��T�����F�6v������H?�22�Qe]9�j����)��]ZRLmKRqP�+��2a���d:���}K�����ZorOm�S�f�����	mnK�������|�9y_I�T$�v���x��������C���,�J(.��*6���n3�Q��S�_�5����\nO�
�[%�������M�������M�����+�������L����������f���s��Q%��	maGFrcHnK���������d�k�y��|�����������!������!��Q�W$������DZs"Ay��bnK/���6.hc�519-�����4�WZL���-���Q���4�?�x���W��y�,����$�� ��q��LA�v���I�neXcprc(�5b�E�"0���wqV�%��F���v�-�uS�����E���k�X-��7�������|�+L�^�A��R�Dr�f\��~ "��q��"^���]U��CV:-�r��������lML�6�����w���)����'�wU
����=Ml��SI��H�k���x�Y�����d��lyP1X���D�TAI� ����a5'��~����
�*6��!���R���������<r90�b��bYwN^���
�@�r ����v���5��
�����e��m��6[�M������sF �',��_�uT�0P����X���1�|�=���3����8`	�y�sBK����nU����.F���<q���������������Z�D�y���Z��;���<�A�7�GT��W�s���,5��9j��;'�+�}O��7qP����u~0�),��?[�������J�3�39�����y��������C�#�EO������"��kK��*�P�WFbw����39)�]U�������<�8���+OoL��j=3�Wv�6��}��F�����6"�%!��'��DJSY��~���z�����S=���h�,����.Z�f�$��E��l�}��)n	��Iv��o�|�Y��������1�.)=Q�B�����f��K1X$�+�M������n���K�3=g@:��7���k��0���\�'O_��,\�&y�M�������|���a��]��PG�%�4�:�������]�/���osY-t�u$~�1��8�6^4Q�u�V�_T
TF�"�f����o�F?0��w~�,�f��ML����E����b��2Qy.N�I�n�h�����y.S�f���y���|'&�#�����|c��aSX0�u�5wO�r��z&�������qM�^$��N��R�k�K����z��mn�y�B��0M�W�����;�N�B��|�T���pK:��jN�^�B�&&��DD�7���+��$$(��u�������Z.�#8�0D�W����O���A.���)dkb�ug4#��Ar�`JS(��7X(�wen���8���S�f18��8	�13��S7�y���K���������Dl����kO>;z<�-Y�������;?l��A�T{i���1n��
_��������|n�O"6v�W��{\'H��l�q�|��M��q2�*
�;��p`�,gW���]�GD��8s��&!�1c.����2�����q2)�aa5'3�Q�')N�`'��J��x�F������S��R��B9�����8����+�|����RU�a�g���� q��)%a�&qK�!_�w������L��
iS��	-�83��O	�6�[������|�WYU�<�!m�q2	uy�^����q��=��?RV�<��l�q2���<&�S��y
4�{,�7�M5N�h�������9W0�M�����<-��0�6\}|Qi�i0��������_�d����B���[j7�?'I7�x��:Cb�!Yj�n]�n]*�.�]�j],j�������<����\��7n n�������I���]����]l�?s9�hZ�@2�Y0o��!�,y��e1Oa���84e����,����&v�����"���`(Q/*�M�Sv�1��P�������U�T�1yp����1�~ARa�&F�[(�-Li
e�X���y���Su�_�$�?.��mh��).�D��L
�WPt�,{�T�!Uf�6MY�=��6S����%�����<�"�_4���c�A�����.���),��*{�c�25W�(3��I�!YlE��<&���'��s���8�8�����*C��j�)O��b*��^�W��
=��7Y����w�`#Bgy���NW?~��J��(*33�^�����_������}||zzz��
����;::�d2q��J$��4�����Q�����,�DI0E�rrP��C*l4��������T���vQ3����ac���������:��]o����`�G������j}���||||��������FQ����K�|��gg��e���h��}{dddqq��~����N������o�.���B��={�������Y��f��X��$�h�����)=��[>�Q��Ml��6y���-���I�+���mQ��$�\J��^,n��r��&!i�������L�X�R=��3��_���`yy�����/b���g�%�>���S�s�N�HDQ��f{���r��u���F�����%�1�����ay~e��G9n2�`����W��2��7n���J�K}�W?�r����+��x��6C$V����9�6u��`9����;w�<~���mW�}��~��~�X,����j�n��E*����v����]�����$UP������TGqP�����_t��A�^����,M�a���`l�5�� �����,]���}�F�l���������ll�w���o����~�������g������#�qZZ��>855����V���={�899�]��L������K��������U�Ke��Q�*����M�=<�EpM�+����?)��_��^1����K&9	���M�������v�{�����o2����[ �%��m��F���`Cr������j�i(���>R����.�I��y]�/�F]��q�����������_3X\�v���Ru{R�lb�!���Ru;�EI�\B�\���V>�z=;9�V���LIIY�<00����%s����}}}w��MQ��?���G���C"��})�9�T��!.=���b���Vc��y:��:h���',�&���i5����~���n����w�����ERT*�~~~E���7�x���+EQ���B�p�Y���u������^|������>��s2`F�����333k��K$��JN�5q����<�^�i��4^n46��:��d�]���sx%X-J�����^c�U&�^�<�y�[����|@���w�����/���k��}���EEDDl���<!,,l��]�������v;EQ����^��w���;��^�`x����#���	>3��5Pi2�%�"i��\9�����f��:h�Cc��B3�|Q�2�lq�"����b�z��r�)���	��-�6�y�&WY�n�},h��uQ����TtK��:�����2�A�����f`�����I�W'xg��Xe�����w^��y�L�0���q f"��I����;L����w�w|eFwz�x��e���Q��F�A.p������)��lD�7[>����$Aqs!��g�X���x��i#~����`��Ux����@c�@�TT��Rd��y�=�6�Kt��S�+w���� ���(�2/�������K��E9���W�_P�-w��$`��er���L���m�@�H��Yd�K���;/qmS��T��������U_\��^�k-Aq+Q�m�>	"n�<H�_hBbB �@ r�������m0�t&z�&��k�d8�1���(�'[I�jU�]<�6r�H$��~��n�Aq��q2�������	�
����	M��&���E������S��EO�W.���}���:{�T��B�����@ 7'���z���Y�hrS��Zo�������Ar����%���v`9_id�X��<�@ ��h��u�\(�������y��uI�����\+�3{�����D �Nt����V.I��G��8(�����ol��]����9l�\�s[�m"][�r.�|�%�!"�]��+][�=��z�9(����YY�H�e|����fy�@ ���;/q�%y��� ���o�@ ��$��K\sI�+J�x���*��K\sI��%	zf���qI�F���e���������"����;/qM\:w�`��+��*�O�2t�7Nd-��e8��82c���7.�]�����|�%�U�,r�f���:h�D���9l�K��'�Dv3���^0�E��60a�����{:��B��%��=�H�����\�1��
�Z,��tw/[����)5����2-����
�����y0_�Th�j3sS����"�i����"_�2�t�'�l������W:c/6""���~y�KL����o��5b��N�[�m>y+ufJ�������K^���������{�R����&�y�]x�t8a��F�t��5�n���z�l���v=���
^~�<�'^����&������.N.��y�%�j�M�g��>l5����`�����Gfl�s�fnw(�M��x�C�{=��s-U������j������F�����������K^:C��Tk��l6q���b��2���y:M-�6�5���j]<���'�7�z�����1���VnH���8��Z���DU��AgY\�\�a��[���^9c��_=Ex:w��������D������C�������^��f��d�����$�F��\�����]�Z3p53���[�Z�SVr��v���%/��w�#�d��_���a��d�O����_����r<s�o1�KWw�"��R��z�l-[Eq��5���H�68���5x����K^�?���
=�����Fg�F��/�k���g��K�����P��C^0�l�Nf�ld�a��
��T��+��Y�
�����"y'�S6V��D-���+C�/7s��'VZ��l���#m��e��_y������k��O��Wv�C����,�c��+~a�[����<�����N�#^��g����x��`���r�p���x����Fd����A�������;x��x�K|?de�[�� l4��+�
��%�[����4�.��L����
��W<�%��-��6�]����
|�b\����I���~��|�_\C���<������k'7ypc���x���^�2��������.�52?�Bw���w�X/���sF[�����p�"����
��d���^}5V�M}��I�s�f�w1��������������Y�~�6�`����
E�&2�!��7�;8�o�&�w1�{Fr���`c���A�h�RE������}���#��Y-�)�x"kI��n��<p������`���ex���b����9c=�AT*������NZ,�D���Y]]��_������}||zzz��
����;::�d2q��X42c��qP�])�����6a���qtt������z�����}��}��'�������RSS��7���o���,..���~���IQTqq����e2�P(��gO�po�6l��D����:r����y��jW�w���e������rrr�q���b���>K>|���?�(j���"�����x@.���)�0��5�~��[-o�<�o�>������G}�<���vpp������������<��;v��Z�[�l�J�d��]�|}}���{^��������zI�?�������?�����_������(���~��u||�������~���{�9B���=���SSSj��,��g����p��l9��T�F{�7"�����l!!!O=�Y���t:����z�����D"��m��htppP�Td����������^�W���������z����v�fgg���x��}e.����o~��f��~�����o�I����������~8//�,��c�D"�i#�a�������[��z��[���1��.�|�+���>��#����?��OPe2�z�!2���������(���^(R����u�V2���_����(���?'ft:����?33����c����O��n�j�h�[���NX,�KV�Kv�XMf��l�����6�|����?��������~���/�����R�����������3==�2�_D��m��_���v����+�<x��P����k����w�������n�=in�����L�9b!������%/�����bc�������g�Z���}V���A��O�#3����>�������+���n����"##�������v��YUUE�����x��������U��s����w��g6��������^{������?_PP��v�����_�x������'�|�,��oy088�|�J���u�����n���"##)�:t��'�|BQ�N�������������V�ggg)��H$?�����?����=�����7�$����~���P��C�����O>�dBB'���y��wH����]�v�8q������f���z����III��m3*��,��{�P(��������w�}�n����K�:;;

~���D"�3���{���8--�����vppP��d��={���8��[�������.--������3�<�\��G����<��?����V��-[�R)Y�s�N??��/���_�����������������Q���088h0����_O�8���EY�����������_��:t��O?�(��^'O����W__��6������O:99����>}��'�ljj*//���KKK322y�rD@@�}��GQ��h|��'�bqII��������JKK�o�.���B������f���v���(������;00����<���C,��������>>>:��~)�B���c2�x�������{��������������U
		��s����?��3��FQ����k����?�����/((��]�KKK����kkk)�����������J���u�����n���"##)�:t��'�|BQ�N���������{�����p�����B���7�a6���������������o~���z����|���'�LHH�������O=�y<<<������C�����:p�y����m�6������R�����w�B>�8����O���������_���{���8--�����vppP��d��={����x�w@BB���>Jwww;880��|��G���'��b��~�3���e��TJ������o���D��� �8Ik7���g��-���E]�p��'��(�j����P��_��<���C�~�)EQ/��Bxx8y�������z�7��}����_
��~����,�Yz�}�Qe4�x�	�X\RR�����:,--��}�\.
����*�o��X,�D���I�vvtt��b�xzz:88���G�����P(���cbbL&O���g�X����������2n�;88���-���V+7�e��lJ�R��r�R�����N&�c{]��<XXX������g{��+�Wl�~��~�����^z�+_��k��633��c�=��3?��O���Y]��j���O����?��#O<��c�=��o}�'?��R�d{�}���m��{����>�������������e3������������o���z
�������=66�/��/����~����"�����
��_q�_m6QQQ~~~sss�<�����o��-������zSRR���?==M/��l/����MKK{��7�����9��SO�G����7X]����������z��?����?��c�}���~��G�������:u���������s����v�����^z�%V����,�~uga�"����6///�^OQ����?��?�������n,���lhhX�����������<���}���222(����GV�k6�������KKK�����s

����(�������^�������b��~��W��W�MNN��}�***���?��������boo��~�;V�[VV��s�

���OOOk4�S�N����eu�111���ZZZH�f�}���b����J�������/_��+�O�LNN�������D������LLL|���?��V���
��_q�_mv�����?��;�������'�|�+_��e9�`Utt��C�/|�k_{���GGGY]��h������3���}��_eu�4��r���?���W������~�+_���������{xx,--��R�W��������j����\{G`�������R��������J����������T.�H&9^#�+`�b�����c���X/�{;�f�N��+
�������b��cbb���n�'''?~��������^����3�������������<�HNN�=����g��n�����~���upp8r�������[_x�����/""bC�������o���w��]���e"����3uuu������Y����wc�7�������������i�������\/pC&�=������C����?�z��[������N��v��?��f��]Ob�{�������d�w�}�^��;�l��g\\\���V-tuu�z��[�����s�1��������}�1����������/��v�ZMvuu=��3r���������U5
����2��������Iuu��]��^���Y���G�����K���^z�����
�^����CA�_{�g�d�|
endstream
endobj
25 0 obj
   20063
endobj
17 0 obj
<< /Length 26 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wp���7p����s������8R.��HN<7��������Dw�-�>'��x\;�Q�d�
)�EI�"U�V��b�������^D�IT�<�;����#%Y��"�����]���������M��P[[{���{}uu�������)6W�>466z{{�����m6���'������:ttt�9s���c������Z�vqq1??���{{{?~<55�j�>������TRRr_o1�LU�i�b�����K$$������\.��������|.�+��SSS������l6[ll����F�y �c��f�v����H$�}EQ����#??��:�.##������=88X*�R�<k�ZsrrN�$fff��[[[�����	`��������.��(___�Ry�o���~�� M�)))����`����r�:���i�JUWWg0��������oz �b�J�:����������|\\y��8h2�N�>������>;;[ZZ��r����p����a��z�������F��Z�������,o"�%V����Sl����q///��r�oD\���{zz���T*.��@>�U�!%%���#���q�f�eeey{{����p�2s�\�X���v��
�����N�"���0��|�*�Cp���������)����,++�J�sss�UUU+�>77���TVVVQQ������gvv�����������������D.�{��y�t�� n��!�H����r��db�%	y�J���lr��������.}M555���wzvqqQ�P���WVV.k32qP�V��������.�N0�d2���UTT�51�S���dm���I����������E
�������?33#�H����l�������)����"_�}g�����VVV��V[����R���XRR������Qb���+\.700�d{zz�O��7XVV�|�f���������������yR(�HT*��l����lW�\)//��l����m��	��\VVv�8���������z.�K���F.�+�S


\.W*���|���w����������r�"����s�\oo����������������s������*
>���������������������;??���*$$���{||��������H.�{��i�M���$.p��K�.�������8qbtt�|����/]�D������'Ozxxp�����
��}jrr������������UTT�<K�`CC���Ohh(�+������333���^^^���yyy'N����e��Z�.++���$�>>>'N� ;����(<<<���<<<"""�����Zwww���__���(R��WLIIIOO���
���pww'�39�G6������F�XL��yYkkkuu5����`���S�H	Y���


g���r����/^���wss�r�MMM4M�d2//���������Hww�����n�������/���� ��qP�Vs���'O�s�����������������\nMM
y|��	���m�8�����=~�xQQ�F��Y>�/
Ig����������}  U
S��3&L��j�nnn�	���)�n%�L���f�����L�����B��FFF���RE��i�,c���;��k�
�|>�9�N*4����g���f����\.����y
����*�U*�\.�����KIHH �2�$  ����T����$"�����ww������n6�-??���c�����sss������LW��jussc� QTTd	�w��1��7n����o����|��_YVkk��f��x����ez�����N��$�Kq�������;w��P��p�\�'=<<"##�Sb���.�W�����������:�^��#'z�;��� ���g^v��E.�;99��������1�����r�999$KbS�Q%������������H<����_U.�K����|{{;����xwz���3|>��~���$�%]�Dkk+�����fJ,�X,f�e2��\nhh��R._���rE"�vuu��IR����j�zyyyxx�]D�T3/0���������-����e-_�b�cEE�����~����\.����4����I�0�Wd� M�,����A��_�t���`0���c111��(--���S���������L���jjj���N�<��#��	.��!��O�:��r�Ox�D"�S�+uwws����L�5�����������Z�v��W�A���������v�1�N&�������I�w�gsrr�\������V>Kjo��K�������z��L�	�I\Sh����!�X�76((��C��:I��}�����������eJ`��o���e
+"44�?j��^6puV�r���Ob7��%��������SSS�b���gUU��f������qss���%_���eee���p�HpY�_��%�?�G�t�Jxzz:555����0R�3q��@!�<<<�������q����-����*��tw�@,��Y��D���j���$V�]9��h4����<y244�\�w�8h?��hnn�����\�
;66f��������			��������<�8x/� ��Ly���P������h����F����EI_����9A�#�I�<~���d���#W\����t��GZ]���?~�8��cJ��	��,�H�e���c_w��aaa����e6�===�R�d�����[� EQd��D"!�bwi��\d���������&�}��e-b���c<��$q0--����l���
=<<���n�2��#����

���PR^^N�4�x��;D�K�8����T*�r���� ����A�j�������1S���_����.�������>__�{�jFFF�\.�b��8H���9s�y9?x�q�������i�����O���~��������������������7x�8HP%��<==�eK������b�������?�L�;u�_�~����h42����aaa��������A�j��IJ�8��h���������19��,$H�~p�]����UTTt�sy4Mgdd(
&Kz�����8H�������8h0<==���+7���)++c��X,������^^^�*&�wt��5�����������q2��
�)�L���p���"���������AfF���d~�����BQTXXsE�����������+O��8h?��T������Nr���&''�����T����6.����G����'O�d��H�dY-J����/,,�����������mEFF�����<�YUrI���;S�?~�8V�B!���~~~d(�^�OHH8~����>�Y��� Y"��MOO'#m6[cccbb"yV�R����8qbvv�����]#��;]U��8</##�,B�V�<y�~
f@�O�___��n��A���������[�ArT�<?H:�%�={��������NW�2}�+��

�����������<�^[[��c����j777[�Y��t����d���(28�����!ILJJbZ�����~�:��H�)((�����j���|}}���I����S���������AAA�$!V�����$fff2����<n:s�L``���S!3���U%�q[&�������������KJJ���===I�/**"�f)$zzzVTT�J�4���"""N�81<<LN5����L&�:zr>���a������{���9{�������V2������GXX����H$"W=�O��,:���s���6ccc}||�u�nnn��n���if�H��jkk�����OrU�\.�Z�����o���[QQ�L&c���
644��D������]]]�jX����Hoo���������p??�e���bttT&�UTT��v���K$�\�D�7n��r�X���G�?TUU�d2��4������iii'N�����E�={������^�z�����Z�P��O������,��G�$fL��B� %L���fkoo///oll$��T*-))a��f2qYEEEGG����F#�D��������h������jkk�c�J�]��wvv���o��������������z���>��0#N5
S��N6<<\YY)���������a��j+��������2-�������%IYYYGGEQ������e�J������j2�ZEE�X,�m���j������Z[[M&�v��@�SFyy��b)Y���;}���j��y�m6Y��������O���uuu�\���������m_������Y�TZVV&���4E��n�����[^^N�(���R^^�}�'�����������`���-�=�����o��WRRr�3���ZZZ�����:22�����l����\>44t�����~�z[[����U*UkkkGG���������
��~o�����������I���$������N��feeeff�Z]\\���s�{���������������W���������]]]qqq�������;))���_.�'$$07�LHH���iooOHH`����.�ZZZ���u:EQ���"�(66���h4&$$����J***����c�Z-�J�#W@�T
�:3�L111+o����������>==��"�����4KKK���)��D2���WVV^�z��866��5p^^sk���Q�@���O��d#xpm�qP��	���������D�@@"���pRRRBB���hAA�H$�������}���������222H@������������+,,\y_������x&{���ewL�(***����dGFF�V�]XX�������#&&�f��h��jvwwGGG[,���a�@�t�655%''3K)++KJJ�j�W�\y�7����}
��\�}��J��fF��]#�
�B(�;k���EGGOMM��E�l;11!���)�2�����n��
�S$���_PP��5qqq����1����I��D�6)���!+<88(�[~���Z�V(���t����r�z}\\��,�fxHUTT$%%544TWW���37��lIIIL�cFF�}+i~~>66����\.�D���z=�������,yYqqqii��0III������R�4!!aelii���kkk������e�{UUU���]]]���$����Gp7o��!�������$q�������QQQ��i;��������V������e�������SSS)����)..f�)����+((`�Vb�8>>�������T^^^RR�r�FcGGGkk���Lnn.��jollL&����

	���y�&���2�l``���#**�j��� y��z/����������L������q��@ hmme���t���


��g����Q�����]Y�� ������T�m�!�+�����eee�4��h��vuuEGG[�V���D"f���]���R�TQQQ
��.+����`uuull,��"�0���333��)�B!��(���S*�QQQ��}:��4��12og�_���8����e�6�-%%�����UjjjKK���1k6����z{{���J��q{{{]]y\[[��$�F�LLL�h4d(KRRRVV
'&&bcc����~"�hxxxdd�������(���0--���]&�eee�ojmm%�#�0Z^^���������l6������������0�5�khh��###b����+��}���MMM###%%%EEEL#qdd$!!�������l)����D�+������j�

������q5yyy�.�~>2F��*
f&M�����u�6����b�W����������`^F"�R�lnn��pbzzZ.�+
2�t%�����E��_�o2�����r���p�j����!+I�w~~�YI��LQ�]����������8e�5�2�LYYY]]]�^G+--���hiiAc6V����+|>���KSSSLyee%��OKK#��'w�D|>_,������������������9  @(�;�<l�;��[o���^�t��'�T�T4M��m��566~������'�������������^{---��C�z��-[���KKK����;����n��E*��x���w;i��&//��������9����IQ��M�����-�}��������
6h�Z��sss_x�������my��W_}��4Mo��] �[�?��SuuuN�D���(J�V:t����4M���p8���1���M�222�{�9R�����pH�����O���G������KV�u��
����������2�o�P(>���_��W���E���p8�������[/\���+�����i�rd>��w�!1��������~zvv���(
R�{�nwww�o�����~���RSS;::8�Z�&��?��P(<���/�LJ���9�R�d���<x�<�D�7o���G&���]�v����\���x3�7�����x����<���7����5�<�H[[9�����K�������z��(�����������f��������>y|���]�v�4����������^zI$9x�����;���i�V�TO<�Dyy9M�{��
�i�����?���b1�L�<���������4������h�nmm��q#E����&..���#G��3J��?��F�q��,�������~�������E
����~���{��������J$�7�xc����w�&��QQQ�7o&�^�xq������:D��:�n��=������������������V�ulN91?�v�!!,Jo�z�Y?��!��4����u�I&���[����2B�;/����T�vwO_�,������f5[��!�M�&�jT����fYm����7d��������Y3�)g�N�g:�;����I���I������X����DHx��#���i����������������&�aZ;�y=:B��>^O��Y��l���j����,�����!�EI�E�a�tX()vL6Z\��|�~�B9�-V%5�G���-�J�*3Y4M���GHx*�$��f�[�%���������r�9	/VZ(���� �+��+�EHx
���b5K�E�:�(X��I������8���B���kQ�on�GZ�Y��9��R�l����+Z#k.: ]���*l�R�v'4�EHx�C���mlg�Ix�J.�? T������x��d���=S��YYQ�T��G�����:���v������`��&Rs:2��*q�`��Nv@�7w�����������^Qsr���)�g5�V?x� #B��L�u�~^��q��
q^Iw�|�W>��"�e;����$e�Q6�X�8?oT[mV���
=�p���^nsJ����U�`Pq�WF�����xn~Xe�����lPq���Igm�aR����!qX��?p�������
r�E$����nh����U��(H��uOk:���.�O�YH�Y8{u.(G�j�����D�$A�\R�Bh��~����[X��h�����=��[\�����Z��������� �D{�sEc�WE�[�<�J�=n����Z�zLe
#���	sN�.(G��I�jS���-���b��8g��\����~w-���I�-,e{'�N������=-cA�n���g6,������
�:~@x�'����Fy��T�K#3������EGf������^)�~j�����6��lh�F$�v������C�4��3����g�N�qK^J�:��k�F�=W���Z�N�N�.���d;KRS���U�S��
Kq������%��&�d��Fg�����vCr�.�I�2`j09 ����%6�+��sOktJ�[������A
���3n�t�d�E�e�����D�m,FHx��vN���)������*XK�j�����GT������������N�
/�['=�HN��U�����+�s��R�U�ve+bec������"����Y���H5�^�������"!�������;=B�+�M��L��H����_����B�_�lGxBr����1���L����e�W��v�])kO��=S8~s����q���H����;�U2�����;ld����~������6vu�HR�WF���1���|�8p(~���������\$9���Q�i*��9n�	�_8�r��.HH,%g��`5���T���82MT��:�<�Q$�I��5�j\(�c��E���f���F��"�Y$$WJ��5�j\(��_�U���k:1n	i����f��4Z�0\a�Y��^�����!whEDB�.������/j����7�;�6�X�FJN��<�����crv���uL5+&d�=���]��s�����|�`�����9����t2vE��8������*`��R�����U�2B��<E���sb���K7��s�-S��\)9�
X�T��RAhE�����q�������\d�b�+Cr4���Pu�p�����N��E���f]Pq�w����^Q���1������k���R�����S%��u��T�e�C�4��yA�|X�{Y~�ZP�02��:	i�'g�"k��ks_�:�R���GeI�I1|�b��Q3{Yi��s-A$���]��m����*���I�1���K��E��s���%�UsS���HHk19�X��DBZ������v���'�8�����k��`�e+b�R�=/	�/L: �: T!"!�����	�]6���M
�tFHx1
A�#�#lg�T�K����brv���H�k����]9���� a��|�`�|�
����S7���q@���ADBZ{��U���!�q/�:x�O�=�3�����^rv��`��m�e���SG���M���������b�kQ99�y>���frv��`\��O�\ �A�����6�1����3�*��k/��B����B���G�z�2���<{u�����a/{2S-(��aq�o		iu�����q�@��Y��[�U�R�e����n���V�#*A$�5��]{<g
4���_�z}/�m"!!�krv��`�)��y>��#"!!�{rv��`�=��������\���f+b�������'��
)Ie��8%�#�M��.!,j5��-o_��*J{�C��N�^iZL�]��d=�����U#�d~���~��f��BvA�n�����qpBm={���_�P�8o��l�����W��4��@�FT��"�N]��U������9�����������&�VD�N��/�5
���d���3	M�"��H:,��7�e/Z&87A�`S����a�&���,����M5+��v�Ym=�v~���>j�H5���~Bc#�BP��g��R�f�=�wCu�C�.�X�a��]>��^�!�1�)���������9^��������Ozr�K{��f���6����8`
/2����R��2[��y�����&������G��I�5�<�f/e�g��;���A}@�����=�0t@�J�Y�������5�@V�����:���H�KQ�#�@~�"���8CZ�"��a�U��y�N��q}����(D{�e��*pm������3U�_��?���g�U���6�)�lcm��d�Lf�R�"9��j�h������
#*GWV�}��~�F��P8��9��Fb/9����t�B�����Lux�\�T�*��U.�hX�^��k��N�bd�����s�Ss6V�+���n���������)K�d�:-����.�D�k;���Hk��K9rMnh�U[LY������W���2�Z�������V�e�wt?��"��DC���k����l#s�
��%Q��N�i����������6MfAG��l#����~-N�~�s�.TI��8�M�?�������������j���/��9�m${>I��j�t�qN���H'��S�+pa��4�8G���q�w�1�r�)M�u�R��o�u�WM$�/[
��e�g�}���vr��.iY{�c���_�^����������p���;\��k'g_�j���M[�:j��n�=y��,M�	5�s�T�
�d\>9������������M!�a���B�����<X'����EQ�������@��0����|>?--�f�u6�b��D">�/�W~���ldddpp���S��� 
M&�C����h5-��l0��Gk"$����Yod3+�.�����\E��q2.�P���l�����;((��_4��n�#
�m����������������?���/_{����4����[�l���.--��O���K�tii��-[�R)����{��7�aS7T&j9�u�U9�-,Nh
-�NW�v��=S\Vy�w���pq\��vr�%h�V�TG�TR�i�������>j���>:66����a��V�tw���^x��Cbbb�m�F��W_|�M���oK7����z����:�o�C%�zt���8R�;�V?��LK�E���zU,f��TGHx�����8�����VCC��M�,�����#��6m���HLL|���HI?��Q�T�{?���}����111/����j��a�D"!�;v�8}������",���{�v���a����IdN��������;A�u�����8�O�:�(�z���rssi�nii�p8���[�^�p!44��W^!%�������y�;��s��Q�8==��������p8
������������Y�P�1�GZ����> T���u�>x0N�������P���������j�}����B����_~�eR2>>NzP����$�E"�����z=����d�p��]>>>+�;>>��>D��}-jwK��w|<����r�>x�K;"r�T�U5t#�dg9��Fff��#GH��l6�Gy������{����RiII�SO=EQK��555}���'#joo����<>}���]�h�~��g���I�K/�$��i��ks'���*�t$����H��?[[���.�2p~�����(�X��_��#JKKCCCi���wo`` M�UUU?���-��dz��g�X__��?�x����<�������7�Q4���o���h�>r�0�T*���_����O��������z�u���.�F'�2��m�3)���Z���8����G��;{������DFF�4=55���o���w��---�����7���w����I�iTT������/^��c��~��C��f�N���g�;���}����b�o�C%(GTV|4�����<�U��2�qp��j�e��8�*�Bp'��������=��t�)�������5��\A���q2��'������%(G���qplQ�k{-��\�����dW/��KL����[?(��1���r����jr�!l	�Q3�d�'R�����"������f�u�8��l��L�ew�����MfGd��0"!�*����>�0��Y���J�)B��_�^vB�f ��!�M>dU�(��c����a��c�zH�=��q������o�h���I��Y�����8��}{pt�|�=j� � ���8�A�5�r��@����Ta���� �9������,ED�UB�(��6<cw[�Gq�� ���8�A�����@{`�CD��"8o�#��d�����q2iu8?�8�jR���B�f=�����g������*�A�5�Ei����zkJ��W�����f���C�������uAc�1['�GF4��:�	�O�>��g5{�&0�(�P�(� ���rk�5���"vi�z������L�5F$�����8�b��	��:�d�����������.�l7�Y�������������	�|��7Q�f���	�������L9��\��[�EHxMr�t8Y�!�E�O����f}2;"$����B��_4����Yg��9{��E��5��&w���FHxg���*V��*�+��F��WYm����z���������A���GZ��R$Zg�XK���.�?�0�U������ ?K�3�p��
��sK;�z�����q�Ke&�d���]_��[��I�V.\�h�r4���T�6U�e/�Y���XP�0;{�Z��8�9f)j5��mF3��b����5��lg���]?KR~&B�;���:�8��4,�(oc�e�Jf�������~��
�ot��K��g��|�:<s�/���c��z��R�*�p�*�HrO����b��X�E����Y@(����;!�/1���mA�Z����W����o�D���S�����w��5f�B�.0�e���fXe��`=;��.�y��� ~��'3����TD�R�{j�"��f)�h��m��$������_�� ��d��lj�����d��].�����������6��uz�V�b/�:�C���[�PYm�������,�*�~Q�����8�AXo����~�� �7+�������d`=C{�3�����d`=�y~�R��w8aq�+�y�T��0������ ���Ap:a��	�m�I9�����m8�v�x�dH�Y��a]�K:��=�=��`��^)S�&��*US���T��K��HR����PZ�����P��l�
����R&����^8������]4�� ��<3=�NQp��Yj����Mp�R:�:�v6�������I���NQp���EI�ip�26k���������~��nHes���z�9j�Z�TV����z�v���
h���9r�i&�Lj����"X����)�N��N2�9R?��98�������w���L����[o���{�m����������_�������?33��C��?�}��]�v8p�f��4��j�����?���7�(,,t�6�����m����/������^PPM����?��O�f��h���~T__O�����G}d�!2�l���Z�����;wFGG�4}���/����i�R��O���:c��fnn�h4~��WL����<�H{{;M�E=���b����x��������=���lf>�������'�CBB~����4��3���_��III�8�{b���9�Z�&���>66622���_&%g``�y�|p��A������7o��tG&���]�v�x<�n�����2����,,,����[/\���+�����i��������w�9z�(y������O��j��P(H��������M��>���r8���1������������s�����~�c?T���>��oy��_��j�n��A"����������\���x3�7���Y�x�gi���O~RTTD������
&&&6l�@�999[�n�i�j�j4���B���k����������7�|���K�e?����iwg�X��������FR�m������?����C���O>���������k�egg�4��O��5[�n���)++���~FN���o������������N�D��kkk;ngxx��WVV�����4r1 	�"����3��===111��Z����V*���777����L&�o��B,��������������e��?~����v�:p���f�iZ��������o�QXX���x�F��~����z�����>��#�ge2����Z-EQ;w�����i����_~�%M�J���'����u��|'���7n$����{�1���<���������!!!���oi�~��g


H�/~����$g�8����/��g``�y��>8x� y|�����7�t:�#��H��]�x<�3V�{��W����i�����<��;�=z�<NOO����j5��Q(�p�������Xq� ))����#����9��P��>�l��}�qLL�/�K���a��DB
�o���c�+�	k�m```��
4M���l����i����hh�
������+>��~���|��K�.�����?lllt�F��'�|����������^���&��O<�M�z�~���111eee?���������-[�����x�w�}�����X,�H������������X�VGFF+�J�-���qqq&��Ik
��Y,�@�V����
�,wtt4##���B�P*�Z�V�,���fkiiioow�BM&����X��e�t:�@�����]�rezz���8�p\�����+����w�����}o��=�������W_|����PV�k�Z�������n��i����?����?����������~��g�7o����<44DJ���9������b'--�7��MVV����������������7n���������>��?��?�AY��q���
8�s\�7���!!!����������_����,���{Y]njj��}��j5Sb������|�MV������?�1##���S/��2�O���cu����?���^~��_|��^x���������������{V�x��u����9NQM�ccc;w�du�8�H9���)g��Zo������i�^\\��O~288H��)�X������V��}<�����~�xnn��>����i���X]��lpss#�����?������UUU4M���vuu1�������rq\��b�+�W�Mnn��\UU��G����G�1��������r+**^�����������Z������o������������� Y��v������Wmmm���~QQS_�}Jeff�_��_�����}������������'���\W8���������AQ��C��}��?�P��~�������6l�@�u�J(���?�|��G��w���������_���_����������eX,����O?���W066������g���������n������h4��PW8����������������,�(jdd�������`08f����@s������4G.���t����rW�Z��8���0���~f�Y�T2��l

����r���B�R��Q������[^^��B��\p",�]5�F���/r8�M�6����������q�X�k/wjj��'��p8G�


��q��o���ODEE��r�1222>\������|�I,�]5�@p��)�J�����������r��`����K7���:t�W���J�"-���s���cH��^x�������?����\,w��������c��BCC�9b����'�\�^.�����%3���Ob�?��C�\.8���gBB��B///,�]�����_�~����_|���c���������z�����(�BA
���^}�U�\.8�J�
��
����\,w�(�
		Yv�D,�����rW�j�;v,??�)�����sgzz�K.�a�� ��
endstream
endobj
26 0 obj
   16236
endobj
18 0 obj
<< /Length 27 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���p��7~����s��$���8R.��HN<7������$�Dwql������I�����^v��Mb�D�(� �.�v�b��+H��(��P���GR4%�H~?��[����y��]������hQQ�������z??���@����'%����T*���1%%%�����|ww������������������ags�MOO������Lu�r����EQW�^��y�v���wrr�v��B�0��������������Z�VMv�����]]]�����+�H(��|��-f�+(�J�����ZRREQYYY��5fyy9..���w|||3����IJJz�I�L�����*�������8�0LGG�����iGG�����I�----**�(���������-�����wpp�����g��Pll���L&SII���������gVV�^��~yqq��]W�^��oww����F���ex����P�����-�J�������������U�R�T&�qK�����`�������-..ZeN�1��������333_����/766>�$�^��������m>.,,la�}�6EQ###������J�^������N>�R�6ydb+QQQ���E�u:]JJ
���q011�����2�R���������444P���B�tTT���7J������Yr�;CCCE%$$���P(,))y���d�x���8�Vcc#EQ�����T*{{�����?��h������������8X__��8�����q0,,��(���d�(���oK����fsZZ���sDDi�������nnn���w��Y7���S���M����d�]����d6�����/���urrZZZ�����$�H���i���������rzz����d*�ROO)���a��V�eK�F�Z����)..nmm%X^^�J����uuuk��fs___ee�X,�������x�ccc)�X�VKKK@@�����q��[�����d�'&&V}�����!�D"�%���$����b�=j��`0�������4=66F����k�g:;;�I�o�j����|>��������"�L����������������������������_�������������V�V�����e�LVRRRVV&��W�IQ�����B[[��K�����d���,��t��o�qpii���;���2��h4���
jwIM&�F�!���`tt���������}��k4���������F�^�q0$$�~�gdd��Cv]VVFJZ[[�uXTT���IQTnn.�/���E����lKK��J��l\


��U*������4���A~z�l�V�%�`EE���.I���`(..^7J$���t�F����n�u�EQ�N�B��a�a\]]�����^��rrrH6,,l��X���d���pR�n�����`w��e6���������x����'��(��-//WUUyxx�����W@@���;EQ��������(���*���
��QLLLnnnpp���}SS��f�d2999988��g������'��6��


E�����b>����LQ������sss�_�N�n�w������)�����~����7A������JQ�����������!�+HQ��\.������x�HT]]m0$�����=��d"�G7o�LMM
vtt���&��;w����Q���x��E�%nnn�������"##+**�p����lm�*iii��5??���Z^^���agg�m��8(�H<<<H�	

��t�%��Vp�������?�	�(�6�L������BaBB������ggg��&cc.^��sWEE��������U������"{{{�����H�T*�z�*�����UQQq�����bq�]ltww'%�����W�\INN���qqqa��===�����yxx�Y�u�ip��|�HTQQAr����]������X���r�8��_$?�B�`Kh�vrr���#{'''����EQ���d�����|�v���	�����!G�CCCvvvnnn����P���������l��|||����hEy{{�N�����(����`���j
�$���������u�y����_�QRRRrr29��������d����I��V��|�2��#M����d���f???GGG�����v�I�((( '2��~U�(M��,	�����k�E���V��)����d�*>>�����>�3d����;��$A-**���P(���"�Ld${�M���Q��
���g�5�ddd888�q-..����}��A___��6��"������b�3��d�\�v���I@'mX�XLQ{�088H>I��w�(��.#�"/\����@�����09pb� �0}}}�:�	

�(�;��`0������������j=<<��Ct:��B��d2�i�c����'6�L�������.Zyy9u��u=\4�L��qU�'�9~������$�d�#��7n��}�6i����J$�����]�t>���pk 9[��lnn^/]��*���������R��=[7B���r�����Fuuu�{j���������f� ��Q�m��d�Ug{{{�������~��(gff�n�u� �0���[�n�������f�������e||������U����A8nnnlIpp����=����}5��
���[XX���[����%GGG�����;���An�{dddm���ZA����dm������4�������]Qd
�=�\.'�9��H���A���{��������� 9��pq�4��u��-�-;��L����������J�����NK�Rn��*���(����[�a�Z�1�3�,99>>.�����������������]\\�������b
98��3z�8x�����GGG�%F��+�1��/.237n��??���f��|�p����X�`�O�_���l	������88==�v�M����#��_�_�&�������n�����o(���(EQ����������.�c

�����$�>����������,���BE���%�7n��sHCrU8��>P\�~�V����.\���[XX`�t���5��vvv��
"**�t2�:���.��ioo������KP�T���������nnn�C)<<<==]��;;;����druu��`W�rb�;
��� w ����l����l	M�w��	

������vrr� �MI psss���qp����in'^ss��7<<<.]�L�4��A�Ht�Y��8H�������m�����u����I��D����U�I�������7���l6���������J����aK���L� K�v�<i����U����@q088�[���la�i�t���e ��j�I?�+�V��������d���z�K`���w��jhh���A�RuwwSURRB>v��u{{���U�����(�JLLdb7n����Y������� �K������d�=� �{�����`;��\�r����d��m\d���A��`���.��_�����a�?����x||<+����d�L�XwI����UHE����������������q���e��Z�qq�A�d������~�a2^z��d����EDD�S0��������h���J���
��dliiY[����y�8H��r/� �!����\;x�,E�����I������]���g2�400�d�\�B�D� �(��A���� �f�������������u�����__�_fgg�����;7033C~P����������A��V5��H�����d�@qp����%���0R[�_t��mY���nnnNNN[~��C�A2���KO����k��I����'��� �W�`K����j�q��<p��=::::99�Y��H��z���������@�d�Q��|I��y�����f�qp�v��Qim�z�~��pqq����Ln��f�`JJ
�Z-�������E��xk���m|.O,s{I�a+�f��fj��� M���4n9��'-k�P"���'���p7r��;����}�8H��{-!w�6��4MGDD����������� 9���m2���IQTzz���#
�-������aaa������#{���Pn!��d��V�����|>93�"��\~��x�5l������ m.r(h6�333IP��jzr�c�Qa���
��fweNNNB��t�i4���@�e��G���4]VV�m����{~p��-M�d�q�����
i�p�hii���&���S*++��5#q��9��|rY���ioo�
�����S$m|������0�@��;��\��
���`�����.)w�EFF�F+�d���f___;;;v�uuu�Q���=�������<��&�����!g�fffH��n����V����������a�����
d�A�V���AR^XX��DGG�������s��J�C���Xt:������������0''�u{r��������@��!��d2�{�o�r�&�J���������������r�T�H�\L����~9����&C�I_��K�jjj�&F����cii)�������I��z{{��HH���guu59fNHH�^W�E�*��e�T���zr�"94]{?��{�_���������u�������W�^������!]��������;;;�����W��d"�������������.�"W��c&����������.,,TWW��Jww���J�C ������������r����{]�V����&W�{zzJ$������$Y��Y����������G/z�������g��$i�r��k��$��VBB��������---%�988���6�������q�0�%���ZuK��I��������/\�����'&'@���K�.����f���GUU������������U�g��!�^O��c�����wvvvtt,))������#G�$j744����&r�����J��!JSS������]PP90 ��s��'�R�������������
��<��y��������4�|~yyySS�s�R�������C���bqSS��>44���������L�g[[������YuO���j{{������$777��6F�i�b;�e2)Yu&}xx�������|R.���544LOO�:CG�kT*[B��4M�%��knn��������F~rr�|`�]�Y4Mgg��g����V�	w����5���@YY���U[\\���)))!����y�
�����kF�NGn�FnnV[[��|EE���������'wULOO��U�6"W�{�
r����"vh����eY]]���>��k����_�l���)r3�;w���3�9Y\��{�4vI7���o�+�����vI���^,�����z�N�~���gW&9"�����{e�h4677�
@v������������
�����
^ZZ"���H$����W���4���_^^^\\,�J������E�a�����Xu��������-b�F��g��u�UUU�;���h��k�������H�gko��c�G����[��?�d2�����^G�I��k


	��o�,���y;Ukkk�}p��<65�12Vv�[��B�2w����u��������m=w[����\�Ld������C���)�v�Ri��ZZZFFFlr;��<00���<44t����ommmkk[��j�B����������sss---2��}�<An����N>o4�5@�088H��>�v���������T�V�T*o�����$�M�999������yyykC���htttWWWGGGTT{���������������������111===���111������E"����T*���W��4M����D���H��"�N���������Pii)��g�R)V=����r�@  �3�^�}*:����>���699��"�����������|���MMM����,77�<�322�}�nVV�����Q�@@n)3<<\\\l����A�Z-:;;GGGccc����qqq111���999"�(===--M �����JKK���JII!e||<%%%**jhh(555***//�<������������U�1�i:<<�}"����@ XZZZ\\��;::"""�f3��l�fww���7�F����@ `�I��������---edd��I���qp``@(�p�����Q]]M�^���O:Ne2�P($������y����	�������������i�N����jd2�S$��������3QQQ����5����I�^O�6)���!3<88(��x���R���d���t����t5MTT��X���*--������������fm6�����^���n+iaa!22�DF�a���E"��.�F�����)�@�>'���P,���VWXX���_]]�6J�������������L��W^^���������B�`oo/7���9==���t��M�;I�$���������D;LiiiJJ���RjjjZZ��lf�jll�u�M�������l9M�YYY999l����*::Z�144��A6$������N������2;;������r���555�����.,,�9ljj���7�L�=H>���n�=x��������T�J�������---�[j�����uuu��3���l�(QSSC�����8��i:&&�=�.2t��������4�aT*��`WW���7M&	����"���n~~����B���d��$������lK�����HMMe/�X\\
��G��ir�N.�����W���j���dd�q||��~a��q�������fsbbb{{;��[�nI�RvIc�`0DEE�����S]]M^��������w��aO���F����U�Td(K\\\ZZ
'&&"##����~"�hxxxdd�������i:///))������)--�pWKK����������F�����������������^��UWWWUU522RUU����^�WRR���022RTTTPP�6GFFbbb���:;;������.�HD�R�Z��d���&�N��dee!�x����a0KKK2����0Lll,{���ln����{�����4�&&&���H$���_N���477�d22�t-�����E�r/����������k���T*�w?2�dy��44M����	����lbb�Mn�`[z�>--�����3`mb�8%%E*��1���������u�����)�������+))�<V�<P$yyyUUU���������K�.
�����>>>B��<��q����������������7
�0B����C������?��s���?����>�����7����_��h8p��M�X���}�<�W,8p������?~�F�����,�X�0���<������iz��}���O>����X__��={�������|����_q��!������~�)�0����>�lMM��`#4M+����O�?�a����766F���o_JJJll��/�HJ���y<i9��?���:""��W_5�L{���H$����#���V_2��&��>���������4MK�R����H�=x�`XXX``����NJfffH����w�}��P�a����{����9�'��H���������d�533������u������)�JR��K/	��k�����k�d||�����r�?���S�N��"�h�������555��c����������x#�����-�������;�������j�x����6rj�������.**z��gi�f�����_���``������> �����;�0�/����M
_}�U�Hd�E�JG����gF�P<��3%%%��8q����a������F�Q��?���d�������0L}}=��g���e���d�O����(�a��;G����o|�*����
�ZRR�o���G��?�sxx8)�����/~q���#G�H�RR(�H�~��'N?~�����������{���#G�����<}�4i6�������w�=|�paa�����6-��K�ke���|�E����D��'�l4�&����E�aw<�M�o?���dV��M6D��%4]��u���-������+w��M���3c�lqo�fy�b+r����c�r�Va���f���+�����6Y/��n����k�&���*yA�l'i��H�3K$Y4*qUigm��Mf��l4��uM���	��'Ug�X!{�-�L�m������[!~yv�xMN�(D�O�	[�k���Q��3#����#�Q>�m��D^(�����_������kSR�\w�PqLCPF��m����x����2����y��j����}�S�'�
�[�^�	���'�
Kg�n�:���n������Kj6��6��Z:/��3��4&1��d
��tJKg+�y!~�,�n��n��v[j����g�lEI~�-3m���E7��v�y�Z�1]��tN�$���.�)D��+�N�� �j�[��?>?D���j�B��`y���� ���cr���.i%��z���H�w+3��B$����U����
Y[Wm��V���KlvN�8;`��w��:j�l����!�dOG��%6�FjIiR� ���e��r���^�$�sj�yQ��3\3
<��<�{�&Z�z�J�2������Kgm]�����%����E7�����m����
+i�.�.��f����M6_d����*��y+�#�f���k7�W�,]��_���O���cs&��1�R�tG36g�h6�F� ���f��W7�E���gc�aT�"�S	-$F\,�n��I��<g���Y.K��M��)^�	������u���g^;'��/�l�pI�,rl^n�ld�?��J���-i��8h��8�_������
�iCY��
��v�meg���������_��\6�p��k���=�i9��?D�o)�X�XIoZ:Z8��"�;&�X9�����C3F�zh��?i�lD��)���u����������Kg34��`{Q
$6�I�(������(�f������gf����~�I�5����j�������9	UK��%KgC�Z�m��������/^I���X:k��	��������%���aQ�pTu���n)���{Y���A��)�i��4�0Z!����
�&�����J����	Cm��:Y[oO���L,;'
\�������D�^�	.i�����6b��4�R���g�@^����#���z���d^c�T�Tv����&�r�rT6�W#��d��
����o�����:�A�4e�����M��l�U�vY����P|�.��=Yy)Su�h�F�|X���m����������/Oc!!ma��V��@���o-f9��o��Z��\6�V3� �d�d��
���b�cr�]b����]8niG&[oU��{��GT�����_�ru�(���/ ��d�����zjQ��^Q�'�B$���2�@�eo���-�y6vqi�$[o�����iaw<G��s��������/�in�ldu��*7���A��l�)�C�l�
����W�_���"��f���,�u���Z�F~�����	����7ex4�$4d��|!�8)T����Kl>+'��r�31C��GW��7��	i['[o��0��+����`����4il��BB��d����j�&��t��A4Y�8:k����$[o��0V��Rx�=X���O�5���&��<��B��c����zD�vN����+;u����F��+��"!��d����8����p��[3<�A$��J����a��S	22vq	�Q���f��u�6�!�iLb�lBC~�����8������;	�hrqd`�s��}>�{�C$���dr7N�e#��K���w#"!=b��.`�+�I
��k��[F{E�!q��y��������	��v� �#&[�B������r����B��Zr��rY������-�Z��m�'AB�������Mo�C�c�T�S	m6���c��n������{��-�A��Tv����r:���B�(�#%[�E��Egb������t�������u���ZI��:%���V����GHH��l������_,Zyb�]bSu����>b���w2�e��#$�-J���loW��}����}�����:CB�~��{��mU{��[4��&[�E������8�A$���l���������>3tR�@DB����{��i�n��n����%W��.W\85�8����uwN�7�*�yK��J�bN3"��H�J�������+���h"!m�d�]��i���o��e7f5�B$���>����^���Z�T�wR8u�q	i�%[����mY��JOJ�r����!��r��e��^����T{R���0�|�FBBz�d�]���^0��e�H����'��������]�-�%%'����y�o�HHH�l����uj����E7�uX�>��Is�h"!m�d����-��B�a��|:z,(g�r��G��R�J ^�,�
/^�+�]Ap`��T�R�o�w{7Q���HH[�l����9�K��	�Jh�����W���3~/<Y.;�0�(v�.Z>m��U�Qou]�qAk^��?j���1����������2Z'������^[#8G��/�{_Ym/�mZ6�-��M����3Y�y���J�g�t�Qo�����&�h���L�Jm��
G�h���
�RV�m�e���]�m�CO��S�fkf���1���=���{W%��V�
�Qyf�\,�q>�q��IT�Q�?��������:�v��k�B��U`k���q2�$+L����{g'�K��)��>�E�}���S���Zo�Y������t]Y��;6���l��U`i`��^�[��!�+���n�ZoV����W��Tu-���ia�rE���?�U)�p�p''[�/��t�^���^���e�E=����<���Tz���~�1�����5*��w������7���|�5��BR��1���X�-����zs��!�zY����+?��HVH��_��������n�X�X��i
���-���9Xt����a��/���b*��Kf��9(Oo����<�H���7<Y�G�4�����8o����0��}��%���v �z2�5�z�Ur'�����:?a]�E���)������'�Su}�n�v������4+�0�e���M���uh��kEz�,}����A��Kg�&z��*���	v�B"�,
����=!~@I���'����'���Ik�)�lw�T-���
=���B��O
;�Jm��t�Dy����t*j�.�������h������c���m�^4�h��� 6�L����.f���'T�z
�A�f��j��z���31#����S�\�����p�[��fRe�T�,��������F�c�Q�� � l?W��/����K���y�x�J����/��wcje����5�i��L���3T�pKGf�����
� X��tC�m�?K�[\������_O�D�?i�O��K�\�t��d�v\��;� � lG�Y*�=H������+�e#����_�j��?5�4�Y*��9^�	����%=����CV`'�8�%[�����g��q2V��cs&�Av���l�S�N������<J��|���F��r��"����������� � lG�q2���G��YR��@�
�o���Y����`+!"�v�mV�,7�f���U��-��7-�Sv��"�6�5����AG%�C�8�A�BZ�&]���h�lhU�O~�����B� �CC{p�A���������J�b������������{�A�F�H���UUU��{���BCC/]�400�666����B���V�i�B=90�"����%�+�t������~�A�G���;5:t��������W^1���C������������'?����>������7�LJJ�~�F�9p����7�b�������^�a�b������|������h�����������~EGng�������\�E�A����31C���qp��A�@���:������r�����WPP���+�O>����X__��={�������|����_q��!������~�)�0�+� ���g�������=V���H�QUM���!�+��J������,�
,�v����VvR�W"<$����8������o��h��xccc�|��})))���/��")�����x
����?����|�	y�����L�={�H$Rx���/��%�).�������B�����cWz�s,�uJ�<���(�h����	�dv|�i�������dF*��x���E�������_�uR233���:;;���w��?O^'''?��ssss<O&���������[}�#���/�_����N
��,�zlch��8���M^wtt�x<�RI�/���P(�v��k��FJ���I*��~���S��k�H��~�F������H��c�\]]�Nw||�qwHrI+u��!����.����Km��7I}[y]7�������#�:RSS��;GZ��A��>��mmm����O?]]]]TT������J�ZCC����u2��pvv����k�c��1��/dgg��W_}U$Y�3�n��^�	gc��Sk��
�s��V^[.���N�Qw�61�����������O?��\!��9q����/�0���?���F�^�����Xww��?��a���z>��0LKK���{�(�����QQQ��;w��������7�e�B~���%WB$�/D�R��)�;=�b��g�^7����
#n����/�����#G�z����������2355��_����G��J����D���o�8q������4<<|��������9r��������I�Q�V�������������h���J�����K���I�Z�s��s��q����q2VK����R������VK����R�n+�q2V��x ����A�$[��`)~��g���-M\��PX�npM�Y.{�qyf��(���Qp�E�_\���S��j�9Kj8���\�9YGN�����]��t�_qw���;:����zLi�o5X.+���������K�L�W�w�m�����a��� l	n{pt�	`�Y��j�3&��Y�`��U�Qou��������xm\����	qw�� �vd2��k�t��8��m��JC�V��W�>j�yxB�G�'�8�T�e��&a�C�(<��6N���,�;Nq6qv��!�=�u�6[^�Q�o�<J����N���|w�X��jRjV����0NHP��}Mc��}��	cT�~^c���a�Qo�g�Z���[>k�U�������	�a�bQ?����t�BN2y � lR��)�~�a������+�7�q��m=�<d����ci��uW����S����{�c+r�s���^���KlF�e���!���B4�-xt����)Jn4�5d���H�I�1uCmuCmY����D�N#���8�%n��c�K��H��E���5�n�����f��O~�cr�I��I�B�G��M��8���jQk*�Pt6v�)+'�,���l�'�=n�6��}G��Z(	^i�w�"�����6�V�d|3uG���
�_�@�2U_����OEM������~;k?��0��0g��s��O���_o���M�^-X'S��8�y�)�rE�n��hZ����4j���9X.�6lh1,�����0N��W���m�j�{�lyQ�z�d`�y�)/�*�t.��eh�q�����{����)��G����u��Wl'���]Gx�UA�?e/����71���p���iC}������Fk�z�Zo�\�w��n��a�J� �6�f��T+�)�q%&~!��he�lR�J�w�4F��v"�4%;NfReR���VCm���-�%������N#�Y9����,�m5����N"
�L�q2�*����p�E`����A���A��V�����X���n�~Q��0Nv3�`7�8��0Nv3�4e`i`���86W���.�\�/X-�������Z��86�"����M,�L,���-�uO�,
@{1
�u~RYv�`v���.��Y�����������y�A��++��sq}'���Bi�Y!{R�8�0��@@�%�����0L��D%�Y'{R�H��d�����.~�
L�L�*��A�1��y�����x%y�)-���\�.�4��:�A���qC��aXa�3���:G���?������Y���v�,�r�����{�*�������W�=6���4�:F��%�r�������f��^�"��	J����J5m�,�EMMM?~���>bK&''������{�nhh �UUUo���{����_�jvvv��\�v������;y���lffii����������o���Yw�6%<<���C�����q�������c���������`��t����kkk�������~�������{�.--�4}����7o2s����>��a�\��3�����b�62??�����=��A�F��O���3C���O?]UUUXX�w�^�������z�`0�_�������?����y���srrH�����8�/��p�`{{;��S*�$��K/EFF������k�dbb���
������:u��NHH���Z���xMMM����c|>����Y�8������I����aaaAAA���:)�����x������������'�����{�9�R���d2)<~�����u�	`��qptt��������w���������_|������x<�P�?��O�|�	y����d2���G"����������x#�����-������|���-((`fnnn��={��!����o<x�a���R��
�?��O���9s���?g��w��q���������z�,���F������[o�t:R}������?����O�&�������466��������=��3dh���#""��������S�%%%���������ol���kkks�&�eee^^^III�b@.E"��������A^+�����K�.��r��}||����z����qQUU��[o���{����fggW�{�����;v����f��a������������o����g���:�����vmm-�0������655���wii����G���y�a�3g�|��g����g�yfnn�v��H
���K^766>��S��}�����> �~���1������������qqq��q�-��k���<o``�}��?<u�y�����~�Z������H��c��|�-f`�������������d�}��w��?O^'''?��sJ�����d2Rx��q{{{[�8����{������~��*��?���O>!�#""~���L�={�H$Rx�������_+^��@ �JX{`{��Y\\d����d�d2�T*�a�B�O~���3g�|�������;7n� ����U__o��xx������466�����������g�aF��<x0""�������>9uXRRr�����>���������H�F�H$���b{;{zz"""�k�Rz��%�\��Kcc���OTT�^���\l{F���P*�B�����:�MII	
����&��:��2��R��������������z���Z�&/������222fff,=Q�
��P��\�v<�Fs����}�k����J�z����x��W^y%00���5�L����O>�o������KO?��~��Tj����O�����

�ggg�g��vuuep$%%���?MKK���G�N������F���������[�z����o�����+�+K@��N��m"##����x������4M��'NXt��n����O�J%[b6�����y��N799�����SRR.^���k��G�O=��E������o���^{��W^~���^z�����^|��������t}}}[[[�	�r�
M�����=z���E�"��W[���[�^�6�������,//���$��-�,������am������W?y=??������2�7�7���`������#���t��?��
�9,,����a������.��_��_-:]�+�+K@��N��m233?���������w�����s��Z��������Xt����o��Vaa�������R�lkk�����/~a��FEE���wGG����s��EDDX�^���}�������Tfgg������C |��'���'N����`���^�^Y��5���A�����_x���>�hii���>�����g�r�aQB��;���KO<���'���,:Q�F�o��o�~�)�0  �����,:]��h������H�WV066�_��_/�������'�|���:���E�B���++��]���w��-������T�����j�3�u�#�����+))��S$�V�"���^Y���W`e����.��(�\.gw�!00044��t�B� ��i:11����������t��0]L���T�W^y������/33��=�b����c�;{�SSS���7y<���������;�<��3���;r�`)))���������&���>4�@p��E�BQWW���������[z\����{�������O���?V(���O��O;r�`���/��������3���w0]L��y{{www��4M�;w�l6[z?�����zyy����;����[��G�����8::����*trr�t1�������[oq��������Z��cLwgO7==��?��X,�iZ&������7�xcGN�F�P��U�mmm�.���h�Xu�������#�.���L&���Kvv6[���p������9]�����+
�
endstream
endobj
27 0 obj
   17852
endobj
19 0 obj
<< /Length 28 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���p��~����s��$���8R.��HN<7������8�Dwql�}����{�g��W�d����^%����R�"��I��I�$��F���PO������D���7��-�}�}ewiccc����^��������~����lddD ddd�j�Vdffj4�{���httt��������jOOO�m�������`0�;w�,UJJ��~>;;����|~yyyS�azz:&&������������4M���������DFF���M�l	E���)
�������T>������?

	�Bgg����Lf�E�|||�|��U[����������j�VTTT������������}||��,n���P�A����� M����...��Z���GLL��b��������������g�M.�?��8��������&''��������EQT]]]PP������{JJ����
�"00������������V�uqq����������\>������866����Qloowtt���1�&��|>9���������U�J�T*�X,��^^^�j�������ccc6Y��*//������_���7o���~�����a�����8H�����f����*>����F���b���XZZ
g�_��S������`HKK#���q�j�J$�������u�`QQ�����������;w�\@@��` ����8::�����J>�?22��oAAAXX��`k���|>?))���V��]����PZZ�n\ZZruu���d����''���Q�.�_Q�6�588�������6�V+i���tw�dhh�}TRZZZ�8������`@@�f�����U;M����wtt|�/����"ww�������u�`[[��zzz���*�������K��KN����I������

�����3'`:�����������G*��622BQTOOOYYYuu5���X,�_aZz{{��������)f�Y������������,//777���644���UKb�Z������KJJ�����;)>>���2SCJJ
i#�(��Iv�l6��.���DV�������������������g����%%%����
+f#�T*�����RZZ���l�Xfff222�|~PP��������%Iw���rGG��������b������U�������TVV&�J����o������u���j���(--���Y��G���o[��|>���b�_i��qppXXXX�]�����+++���:::��_����f�\.'��V�]�
���R�����.k��jM&SccciiiGG{En��%�J���ZZZ�F���_^^&�W}}�V���8h2���yT*�C�T*2����L!{29U+..&_���^�������@^���1�5r���������f�������I+�����a���?�T������H�Z������&''��������b�,//;::�;w�,��k��|���<)O�|��K����x���\Y;joo'!#77�������

r���5�Z�������������7n��,//K$2�������+((���������/,,����������?�.u-KXX��������

utt�K%�b��={����l6�����`pp0��������d������APPYA�P�>�'�Y�.]��������������mmm���|>����������'#(�������������I�Z]ZZ�����[�n��b2N����b�����������%�H$NNN!!!���111����2���.\����D(����2�����F���������������,v������w���$���FcDDDPPPyyy~~���sDDSR�8���s��E���|������8�K��PTTT^^��+WV�������P>����u��ooo>�ON(�"M������_���8w���Y�R���l"ww�����������������v���y��J�/
���dL>44TUUD6H�m�8H�0�a����������.]rpp())!o
�����������&777�@����\�r/�8C��Ol`���*�A�Fq�����AH�D���^����I�!EQ|>?$$�,�@ ���v��&''�^���KJ���awww��ycc��v*RCY|||&&&HmK(�r��E����z
�_d��D$�����������2OMM�������f�������������J$�����oooogg���HR��Z�iii|>�������g�2Ye�y8�������_�(����)�����|��v����U^�R��E�7�� i����cbJMM
��'�����$?zMM
������DVV��������������^�'��.��������aV'77������H��iO���b���(''���`�V��������]+
����3��������������|||����Ldxx��o�B����F;�����q��<�� A*;�N<==W���t:???��Q�E������
�B��H�s/�8��-������6��8HJ�U,9�H�#��fffH��Y������P���j�D"���X�T*��wuu5�L����/��|v+���5k�?H9����L!�L@@3enn���_�p��Ummm���TX6�{���^[P��5
)����cccsrrBBBVJk�ud�=[ZZrpp8�<3ett����������vf
���A����P(�V�]�`RR���A���8H
Lv���juwwL	O�{����C��0A$..����i4X5���(R�Z5fo��co6r:�����xOO�.������k���z��Y�������
����#k���D�kff�i��x�"��g���C�[|||���D�oZ���C��;����}�Ar��n<$q'99��B��vT��������������8x��Y>�����l�M@�2��t�K$��������622r��������rGG����M_5��HA�����|>_�T2S�QdR�_�v��b6��^��i�4�n��DfUc ��qpff��A���n��K:YGv���i�L�������[�I6���������AOOO�UMI	���$:��)|��3d�bHHHOO�F5>r]���������;88lt2������}}}[[[�)������K�	�*����H	���=e2�L^^^K,�jF�����j���i\�}V�<�����������F�����/���;����� g>�/�j�^�|������_�x�d2���Qu��r\�������eG��fmA�n!��� ��<2�3!!��B��}}}�)E������{zz����s��� 9��B�F+�n��������uuu|>_,o�UdW��!��	I}}}W�^������

%�gk� �v���8H�hnnn�~S��;�?""��yj�Z333I5600���ym�Hz�V�$0d2��/l,))qtt$���.������%22���3   44���_���d<��*++I�"�����f���M� ��H��y67����������8��o�^~qB���o���Aww��'�O�9�,Kkk+�f2��F���{KK������Ctt4�r��M��s<q0??�4�0�������.W�-��m��)��2��k�`OO���CHHA6��[��M$%���^� 111���H�UBB����L��gJII�y���o1fgg�����~DDS8���F�a������F������Av�*$����'�_\w�a�c��`����8���w��	���^�U�8���n�b����Z��0aN�$	3\
���`kk+��_U��'sq�+e�p�vtt���n�R�%�������9Q\�e�����/��������/��.q���s��988�:�'���]�^� 133C~���f���2��'����r+��x����i�e��{�������"���{���]���m�*9�bw�1Ck�i�����A��� �dH����F�=�������`JJ�����(wwwGG�u+�����svU���^�o;�=�Ar&�GM�Z7%:;;��?��f���Q$��tqq���(3#rD����"E��'�q���M��g���q������C��	�����Y��$.�G8X�V�@��{���ZH�]��Y2j788x��F�z�K+YCC;���[f����A�e��!���%2���.�d��H�q�P�������kw�8��\-����/�:��1�V��}��Q��_|S�w$��1xd$�B�������#�?���9�%����nbbbm�)�����Z����@  W���f�XLZ��!� @v�!q��&������c��{Rt[����lR���jz2�urrr�wI���_Z��������,��d
		quu��]������@�0V����`B'�����d[ZZ�U��i�nd������=�`7��=f��`�����k��18)�����ttt��5#q��h%u���/3_BN��0i9����������������e�g�HLE�|��7��B�
�&Ejd| EQ�";vf�5%H�������R7g���6Ffg#���������rrrZ��0�@a7�����F�a�@�]tSE�=d���F#�)��h���������H/�������7�s��������j���������b������a2���;�>�B���'9��^����������477��uGGGr�2�����p�����R2=88����\FN�zzz���%	i����#E���������UT�"���M.����#U����s������{�$%%m4c�-����A�\^XXH�����1%�ww����'%%�������c�Z����d����^G���@�8�p����bb"Y��/�R�����������"���^���222
���xYY�2�p�B}}=�U
�ZpJJJZZ��jmll$����CAAS 444�={6444++�$HLL$G��b���'���������9��"���5������)..NJJrrrZUk
�L�\%11���(
GG�����������g����Pe4kkk�fwww���fj|dbBB�����nnn�^[��������>&&�4�644�5uqq���X�CWVV:����K�Ig����!-����^^^�A�|�F+N(�Jr������f��s��ev��j%M
nnn������b���?999::Z\\L:����t:]WW�\�Dy�L��4NLL���������8;;����92����K&��;#�����������urr*))���L���lniiill$[���e2��[�R����;lGDD8::����I$����3���%==�\7q�w�������}�&�1����L�J������TTTTVV�O*�J��^�n�*�����F�a�0'r���{nnN"�����z0����E�gF����ry�ko�������JJJ�r�A�^��<33C�p����g�B��(jyy����E0w?#��#��������j��}�:��qOOOWUU���+
�����#���JKK�R)�$1�L���9�ill$S�~���ir�����U'$d4>����������!sssuuu��h�B�t�UK�����d�tqq��������k~~�|�\��W�X���h4���d��=�2�r��l�V��uW5���A��Kv����������V��������#��������1���VM\�+wtt���WVV���1s\�UL]���/����������9jVY�������Fv��.�!�kw?r}yy9{[liz�>""�it��&�*y�P*����I��������3�����b���d_����
��.1Dgg�F�S�6�ml�1�\��6���h�[��L�o?�����G �6�T*�8yr_����f���>[Ess��+������v�U��jlii�h������:::V�[�R���vvv�����������U���u�VKK�B� �7��� w�"Y�,�����222����z�Z�.**�q���o	EQTnnnaa���H^^^~~��P866F�������.���'!!a``���%..�y0���P\\\oo�B����c�K�P(�b���`sssbb"y�����X,�z�*y��`���kii�m�v���r���j�Z(2�B�
�R)
I��h4����}NtZZ�������TR��v��8�����;��N�y���@���W�^e������<lnllL(����������j����qP��
����������x�PH"���HBBB\\���Xnn�X,������
�������322rrr���H@���HKK���%O�������'��e��d�FdeeUVV�?@QTTT�8���Q�P������(
���vvv���X�V��V������h��<22"
�f���&���KKK�������.$lW�8888(�H���;j���tvv�4]PP@N���E"y����|tt���4	�QQQ������B������(��/��V-@{{;�S${������U����mll$��wNMM�F�����^��CCCB��y�(�������v�P�~&�P(d����bccI��`3�C���<!!��������������VkBB�������%-,,\�z�DF��[ZZ�b��6�N�����i�P�<'������d��������������������q���966������;;;���UVV&&&vww+���4�������[�n�����h�;I�$}�D}}}TT�����m]x�������---���gdd��0.���_�NQ���lQQ3��������\�n%�H�]�V�2<<��A&$����]��������:;;������������������a�P���@�dhhH.�vvvFEEY,R$`�Q��>x�������������.z��-�P�������j���'''��g
�"J�����+��cdP�t��];����.##��i�F���������X,$ ���b13������i�Ju����v������z�*S�"�0��������)E"i�(�t�)����(��>�VK�T�����`�_X�d�g}}���V�599Y�P���~�zss3���2k2�bcc��������!�
E]]y][[�t���F�����h4d(KBBBFF
'''�^��\_@����###�������/���������B!��322L������H-++KKK��t�e0�L����������u
l


�dttT"�dee1W����555���2�������������.�vdzww�X,&M�IIIz��b��������'�jrrr
�=��d�0������vf&M�����u�V�U�1�+������T�&''���H�T*e2�����������v2�t-�����M�����F�B�����N8j�z�����$�����,��d�(���]N�ifgg����r�5�2�����^[+))IKKknnFe6�%++���+22rzz��^QQ������B+O�(����$������d&�d2�HD����quu}��7���"##�����T*��E"��?����?��|��?����_}�������t�}��EGG�����G?"O�-))��o_MM�@ 8z�����nrrrJJJh������x]]]E�������<���G�����k���M�����?�<�Kbbb8@^���~�)M�
���O>�d]]��V�n(�R��'O�<s�M����<o||���g��������g�}�L��x��H������O�����_|�b����K*����
���|��������?�ypp0EQ���<oqq�������/����d����92���[o�J�tjj�SO=577���������G�:::�|�������O>y�����N��V�����{N$EDD���Kd�����S*������{'N� ��b���{u:�����d��#G�����tbbB����	�#�������~��7<<<�z�#�<���A����������'�|��(��������o�L&�]\\�}�]�:00���#4M?��37o�$_|�E�Xl�U�Z��iZ�R=��eee4M;v������������'f��h4>���d������~H�tcc�@ �i���u���d�/����X��O�>M�(���|�;����
�ZJJ�+����������kTT�8==��_����c�jnn&�R����~����G��V�����{��w�\�r������7'O�$�F�V��;����[,**��*<�I�`P�,�6��V��h�L:�]�jd�kQU��&tN����b�A�J�<��/�ZS����nl��1����\�
����vh��M������,��@n�8���v�D9�����AvX�7�Hm�IM�����d�f���d��\��"[�6�Q�������T���1
~\gec�%}��-�����[�\k
�A`KP�f�g}*�r���7��'u`�����vN7W�5�V���������U
��I���E���q���zD3P5����4ZE�������0��O�A����m����
pO
���II
����0�g�T�X3(��A�z}�t@!l�
"k��;�;���I�S��C�Y�u��$j� [�=x.c�+oek�f����T�sSl��LM�_��1��[l����
pO�I��9�4��o�{d�M���6�Sq#���>��['��r����:;��0�����W�W�2��2��i6���9���H�OjwI�:��G�f��'���a�l����
��T�����0��9����O'D3v��OA{�}
�m��KrI�rI��q��{7���)�����_���b�R�.�T����'b'm��9��{7���V��Oj;ym��H�����?�����	��\���4����2a�I<�z}%9�g���<��<g��p6y6�^�X����-k�����8�^�Xr�"��c��%}e�Y\gk��$n%'����u}���cb�6�i��
�m#��s��9K������9f�:+�,9'��fg�	�_&���\gg/.�L�j{��u��~#Y��5�����+E���JW�m��=h`�[��s�
s3K���caRAD����$��+%S������6Z����������;Q��l�����L�Q�bt���P�g�n�%Vk��tu6���m��dS�w�WNBd��Ac���6Y{R�����aRAI_�r�gp�;�]�$���s�
+�X�&��"qpG�����vSU�4[�a����k��3a�v[��m���![L�DmD�{fk���������aRArc���~���3&����
��T�;."q��}H��<�r����d2�@�Y���tF��u�D�y�6l��
��L�>�`+��Y=2���LsHn��������Y���H�)�������M��,wI��"^i� ��<W�g�{�������*�Jb+}EaR�SJ�k���
��
�o��3M}.]�]60Gs�fiDe�{�������*�J�%��1aRW�����Er�]��rb�6�N�]��46� ��I��Q[��8x�>�/��bdD�r�]�Wc���"!q��}T���g3���t&�����i6�$�l��#;�tB_��`�	���mY�Le�^�
����\�\Q�s��$q#7�<.R!"m�d����Z�R������O6(�zf���B����2��Y��q���VkY�����������������vr��)�;~�I|��H�#��P������=^���z�q$���q����_Q������+�goV��.H�!��P��a2S�*�aR�C��.�uqJ���|H�"��h���/����I���l����zq_�m��}4���)[p��|��fC�!�_���uN�sN��.��4���.n���O�����r���4�����������2}���;��-C����c�.���61��h��q�h�i����M$��O�>��~���Q��f��HH��}4lF�~|~H�W��vyq|^�]�B�(��HH��l[Tl[�7�����t�Ig0�Z�k8���Mc�� "!=H�w������Zq����L�\�r��n�4I�]��0���q	�A�����`zq�_�S����[�lh�k�+��0{�<�D�� ��'{�[�l�:�^���g���\g�o��l�0�o��t���E��f4Qa�Q��fL��Ww��0��Z{*n%��4;:k��	i�&{�"[[v���(�B�y��������XUu7E��(����"�������L����c"�Y~��G�Z6��|"!=h�w)��]*��R����r���M�.W������G�����e��chEBz�d�R`k�T0��U���t:����3$����]�lm�
��vQ�A$����]�lm�
��q2��HH[1����PDB��������'�8�����KN�jF�'����T��^tN�9.R!"!m�d��	�Cz�v�b�./������0�@�,P�U�f����InY���HH[4����4e~�,T1%S��T��i�+�M'�h�.%��X�q�f*� �M�.�6Mf��0� ���?g2��2L*-�������0�qCz�����E���A$����]tl����������"?��+/�5���������a������������h$$�o��]tl���e����'���~��E���E��HHH���]zl����Sq�_�2O��e�/��PDB�������������_y&�;�Q[���l��������^#w��C����u��9��Vd����c�m�m2�����fY�m��]zl��\M@��0����V���*���q��j{koY������:d�4����7}�S��feI��N������N�.�6Gh���X�r���V�A����Z�m�V��5Rd����l��R��uD��[����)�op$Ng��M���������$�1�"�nR������p���3;%�Z���C#q����l��\
3NfG��I��a�|�[��(�j��Z�r��8Jg0��8�}�x�[�Z�8�m��w1�����.��/�<7�C���-����J�cs�/�AyF�)8��N�����F,���r�������lO�F
��~�,����������[���A�!�E�����+��)�
�jn�j����d����9���z�������u�X~)L*h��4{�����-�4������'�����
8�9�F�����lCe�7�������}�M
����������e�g�_��	2���T
�;"v���U�N�����y��,������;M��k��k��]�]�6���lCU��$n���"��J��$��r�������A^Lm���a��_�A�s�e����{�G;=oU/��_��\;9��avn�:���o�=�aH���`��h-!������l��|����h�(j����n�
�������k#�z����dr�~9F���lQ�����r���ab�'[�C���l��%���&�v���YI�h�Z��D+v�w+N�G�����t�aomz�I9O���a���?�S(
��R�����?�;�����P_���.Y�������p��LJT�`���Z%=+w-�{�����?5l7�9������u�h��\��)s�L_��'�w��9������A�]����s������(����M��b��-��g�ns4���������F�>��\a��H7���-T���}��$��-���9�^�������aB�M����������C�J'Z~�J���BMi���;���p$��t��>����'{�����h�Q+6��y-�7��u�/���o���4�q��d�����
3N��qp|���������,�������.��dH�]f�z��7q�M��f�V�A�F�?�8[{��w����!�Fr����eN�.��A�Na+b���Wn�Q�yg,e����lh�_����l�8HQT~~����P(�h��t���
//����W�2��b����K"���������������Af�L&����DF��&k��hN�N�q2�3L����I�8�����8��������/�L+��E"��?����?��|��?����_}�������t�}��EGG�����G?����i���d��}555������_���vy�kZ>���:Q�vQ��W}\4�8p�n��1c�����B�0..��i�J����J%EQ{��),,\iW�}��������]�v---�4���������$&&�����_|�����4}��A�PH���j}��'���l�v��0�4��������0� �-j�8�i�bQ�%��Kz%�����52���c6�GGGy<���8��g��������g�}�L��x*����?����|�	y���/Z,�]�vI�R2���C���6_��K�2�r��|�F9��2Q'�����s=�f���/I��>�E���qp{�A���~����l�����y<���"yk����/_~��������������[o�u���:55���������x���d���Gm�Z����0��f��e��q�2L*��;�:Q�:Q�]�+��b�� ��A�A���a������������ugg'��S��$��s��D�����^z�L��� -������{'N� ��b���{u:�����d��#G�����wbbB�c��E�I���&��������L'#X���~B�yQU�b�m�U9�N��^��l�A�4MLLp�����~��iR+4�Lz���G��� ]{�?�xMMMqq��O>IQ+��555}���&#j�w�}��<r�M��<����7��_|Q,�~��f�By����(��(�s�s��[��l���D"���O��%%%���4M;v������������'f��h4>���d������~�r�Xc�@ �i���u���d�/����X��O�>M�(���|�;��2v��2{>�d��k|���`kC��v���z�����6�N��������c��:t����|R*�������;z�(i5�����w/y���+���o~s��IRm�j�����[o�u�����"���C%�`>��b�T������L��R�|�� ��R/!n�8��wCTz;*�����:�Y�X�m1q����P���c�g�<�Yy��2��fR��;0������P3��f�M��yeN��GTxB��Sk��h?��Oi,SK��t�[����f'���Y���d���8�����0&�"�Lb����GQ+�������@�U���;�Q������Y���������T=��O
6�wC���������O�K:L��N�8��L���y�F���������t���79p��.z"N�Z�����+
 ���.;�hE� ��5��z&W�l�������������	��#�Y��v������v}pl�7�{��[i��{xB��'�8��d�N%�t����"�5��%]f#�Y�A�\�q2��p�j����/h5�-��B���<8�����B-��e
����%����+h�a���6���4V�2��D��X��&;�
m�]�vV�E��/I��w"�C�R��aoml�qp��R�!uo�JA�&�r����,�^�=^q~;�S��z�SMaR��1`�8�h��Z�:����8�B���:�K��%V��(�zzfZ�:b9��K��8�BT�V�=nmj���)���:���=���Y-c}-c}����f�r*`���X����V>tg��	��c���g���;��j4�D���'L*��#����!N������$StF��v	�����:#e4�������-q���zo�z�������I����d��t"v��l��k��E������=Y4�5*��ZZ�-+�Y�Z�d1������g�m�\x�\��n���v��!��>��s&~����/fW?����x��i�;L;4�����Gv�W^�������jmq��o��7i��2p��������q��+C����!��D�dC�Q1j���3Hip��]�g��1j���,����P3��
c��yt��E+m�f��]���9��B{��P3�d~�x�[�?e��X�42v���;!�y� ��>�I���:�WW�}��4�8���`g�q2S&�Y��d`�a��"�N�'�8;
����a��d';�E`'�8��P���d`'�8��n����I�{���G���f��fY�u@I�%���x%� �W\�J�lzq�`�L���`����g���/I�P;���\�=�>����Q���6q�uO�r�(��G� �]�$0L*����=��=y����q3�E�����"�Fk����� }������I���e��q��I<k�����)i�	L6Nez{�=�t���������Zu�^u:O��l�Uwj{�����&�Y��qh�l�P�^uzZc��u� Bw�[;�� ���1s��yXu'B���"�t]�[��G����+U��!���2���w��[��2�2�+��N%�����Xm�������G����������7�|����>x�`SS�(�H^{����~���������/���8x���#G�?n�Zi�^ZZz��w~�����������]'�{u���?����8���o����4]UU����d2�������4M{zz~���/����w�^ZZ�(�������4M�:u���>�iZ�T>��sss�X?�����7_|�u:�#�<�P(h��(����H$EEE�w�&��d�=���db������w�%����~��_�4���O�����?��Ol�r��
��S��$��s�]�z5<<���^"S&''y<��� �������'������{�j�Z�'����#G�����b�A�\���Iv����/_		y��������������[o�u���:55����R��<����L<z�����m�	�^�������'������			�>�,�200����Ce>���O>���������~f�Xv��%�J����������+G���� M�?��i������k��������]�H%�������i�b�h4��E"�/~����N�����i�~��7"##������566�e���l6���;�����` S�]�v���L����<y�L��������E&��������4M���<��dh����cbbJKK��������}�����	�����v]E��utt�������^^^)))�b@.�b��������C^���������R�|�L&������5�6_3���D"y����~��������������8x���#G�?n�Zi�^ZZz��w~����������vZj�M`0��������4�����������w�^ZZ�(�������4M�:u���>�iZ�T>��sss�[|�RTT�{�n�Z&�=��c&��y�����w�%����~��_�4���O�����?��O���� <<���^"�'''y<��� ��{��w��	�:))i���Z������r2���#���	BBB^~�e�zff���uuu1����[g��!�SSS�z�)�Z���������G�:::�c�6ABB���>K^�x<�P��>���O>!�cbb~���Y,�]�vI�R2����AAAk�V(���Ph����
���kqq���7n�����i����hh��D���/�'O�:�����4��oDFF��}�{�kll��J��?��O��_d2�������IK�x�	��u:����cbbJKK��������}�����	������^|�b6��b��������C^���������R���L&������5�vZj�-�l6�j�Z$����f�ccciii�/_�D555��6�e�Z����
���35�nnn��z\�K���KKK999YYY333\���~�����l�_m{:����������y��F��s����+/��Bpp0���X,�����>�g�����?��s�?��O~����f����G��������0�������8�owwwKJJ�/��������t�2��;�������[�����{���3�<����dPw�_a���+��W;���W������_y��?���E���c��q:������'j���b�Z+++�x�
N���������iii���/��s&��c�q:_oo����/���/�����?��s�����������-�����mkk�i:44����E�4=>>~��aN����L�~���_��\�W;������M����?������t��n���jjjZ;������k``�����������i�����;N�k2�|||������?��v�|�ree%M������������t����_q��m���&;;��?�������x����O���������t�������ZQQ�������Z������������|ccc�������d�V����cbbl�_utt�������Ly�u�����?��?yzz
��O>����*11���c���8�/�+�W\�~e��j��(�����<����������g�}�[���k9���H$��~���#�<r�����iNg�����?���O?eO

����N��0��~~~���Iye�����?���3�|�[�z��G������i08�)�+�W\�~e��j����[�D`�P5::���������m3�u1#�l���;%%��s$'�6�#�+�~���_�����`����0�LJ��)(L&Spppxx8���>�JE�DQTrr���GYY�3��|�����0_���i4�^x������';;��%%%\�3�|��|��������x�3g������7�x��'�����|�6����_���������b���}
����*�����������7�����w{�7������'O���?W�T���/��/�r�`555�?�����S�~��`���}������!�)�
>}���j����|��|�������l����������o����8;;�����x��Y���o������{�����~�)��c��{����o��fII	EQ���db�+���-�6�R�D"����/�{�(�


Z�{"�H:��b���b������y�����p������m9_����/��
endstream
endobj
28 0 obj
   17990
endobj
8 0 obj
<< /Type /ObjStm
   /Length 29 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
29 0 obj
   16
endobj
30 0 obj
<< /Type /ObjStm
   /Length 33 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ak�@���s)�����D	9�TJD�IK|��
�����%���q>���� �(��4H$R!��@����x�jv��M� l�>�������Xs�*�+����D�������@k����\dl=���XV�X�k���g�m��\ R�(J�\i�,�g���$�QN�H�L&O$�&c���x��e�o�����Z��y*�M�����wH��+k�E��^�2:���U�k���:�Wx���;�K��oW������t�b�������������P^�M���f�3�,�oRry�
endstream
endobj
33 0 obj
   298
endobj
34 0 obj
<< /Type /XRef
   /Length 139
   /Filter /FlateDecode
   /Size 35
   /W [1 3 2]
   /Root 32 0 R
   /Info 31 0 R
>>
stream
x�-�1Q����ihhE�
��Bi�x&��(D5@����$T#�D�����}�m�������}Gs,!R����>d�y1��)�=����+/+�9��|�~��,oo������������]����
[D*+��~�f
endstream
endobj
startxref
166101
%%EOF

numa-benchmark-epyc.pdfapplication/pdf; name=numa-benchmark-epyc.pdfDownload

%PDF-1.7
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x�-�1�0Cw�� �	��{�L�+D��*�p|�)�d��V
��R������k[����',�-|��yt�����1���x��	�DF�n3��%���?2?TT�-��
endstream
endobj
5 0 obj
   110
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /XObject << /x7 7 0 R >>
>>
endobj
7 0 obj
<< /Length 10 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Form
   /BBox [ 0 0 1754 1241 ]
   /Resources 9 0 R
>>
stream
x���Kn!����(���������Y%�$���LHR*V6���'7�W���l_f����� �����c��FY��#,N2��z�u��nq�*�,�E�<�bP�@R����.�7��qM�X�$��FD,]?D1y��"2�0�f�y0��A�J)4��_A5��B�*c��$�R�1xL��	}aUE$�d��gzm��Q�����o�����Vy]Z���g���
endstream
endobj
10 0 obj
   233
endobj
9 0 obj
<<
   /ExtGState <<
      /gs0 << /BM /Normal /SMask /None /CA 1.0 /ca 1.0 >>
      /a0 << /CA 1 /ca 1 >>
   >>
   /XObject << /x11 11 0 R /x12 12 0 R /x13 13 0 R /x14 14 0 R /x15 15 0 R /x16 16 0 R /x17 17 0 R /x18 18 0 R /x19 19 0 R >>
>>
endobj
11 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yt[��?p������2=:)��SJK��iJKa�����ah�-����Y'q ��x��}��}�eK�-���K����]�~Gy��j;N��+E��s�?�\����>z�}�M��-,,r��iiiG��Z��/�*.�g�6�$�,�H$:}��������\._s��d2���{yy����������?A��;v���#00���H��15��999������EEE�e�����9rD��:j��."�J���***Xe��������k�`?EQ>>>������F�q���y}��L���(�oy{{����'X���O���`� UUU��������r���*������(����+������(.����EQ�������VUUQ������;Pee%EQ������F�qvv6!!�����(����^���9���g4{{{}}})�b�Beff�����AAA���L�S�N���l���M�����P�R�6��wm\�/��;l�X,����d&�t������j�^�t��(>��<��������<�zxx�����eee�3���)�u�
����T�T������>|8++�h4���������>|8##C��3���jKKK������===���

���b���;{�������oXXXKK���L{;<<��������������%%%			^^^.\���DFF�;wn�?��H�R���%$$�^355u��{�%4�eee'O����<~�x~~���Mj�:++���#������333�Pgff�������=����V�W<��_����}������			���d�/]����x�����������(*<<|�-��_Pe��+`�~~~����qaa���y�������t�.A������������@��RRR�������;FQTvv6y��l&�Ldddiii~~����)�bN�Qu���������#G�PU__O�.Pu�������������������?z�(EQL�"-���HHHaaaee��ka0<==��m�u��%%%d����p}�\B��L^'**���$..����?����z����3��UIIIHHY�������s������������b�:�z_��w���>������VZZCQ������G�^�5��-�H$E���LLL�����i���J��y���?EQ�����h�r�4666""�<����x�"y������w����6��������+Z���1���������������,EQ���$���#L��/����$]q��d2=z����d�R����#�����}M���E�����"�����&kd�ZCBB�������
��J�tyy����#u����e-)��QQQf��,���)�����������7<.z{�����[M�d�$��k��� 9����e_4O�<���=<<(�"���X,E1�?���}}}������><==M����EQ��&�?�
�������l6�)KKK+�������R���	��TQQa������%C��M��srr(��|�2�������"//����d2�����9.�L)..�(j�0�����Kj_yi�V�TE�={��	���o�

1S���)�
		����l����inn&]���.��0�o����f������g���:t��(��[�V��:�LY\\���ohhX^^&o�����g�Z,�����M��yU����y&S���-�^����'���mmm���C����kkk�����]����������?��Z�>}�9�����Dfg���X���w�%$W���&O ]T�.�N���:h�
z��\�`���h���c0Ho��333�����]+l�?H���_��`:�d��_�a4�7�j�K�XYY���/�j�U�~V���O1�L���#�\bbbuu59�h�/
����0((�L?t�Pvv6)4+ZZ2P?<<�z����`\XX���ui�zq��]s	I�-++[�^�	������F��}�&=wOOO�'����c�7����w�fsKKK\\s������y�:{��N��+�u�g{���� �?�B��R����cz��\Uni�u��/cBj�Z�����fgg���>LQTGG���S$��9Y�f�`SSEQ
��f�w�"�_o��k.!Ob�u2��~ ����-�A�C������������`Cu�6���h���!c������
���������J��0rF�~\+�4>66v�e������a��:WU�[�h$}���"����2��9q����TXX��� -gyy9��\.'-��x������
2z�f�`bb�����/��u�����&%%Q�\�F�VTT0����T���'���F����rp/##��>>>E1C\��?7_oa����g��Z//��\��Y,r�4�fii)EQIII$mhh�(������'R����V��������u��w��:x��e�����6����Y3r}9�v�����9��266F���U]]EQ			F��<�|{���[XX0b���������i�a�X,~~~YYY������~������������R��d2)
r�/�i������2��d2]�r�y�
�8��)���>>>����B�4M�����R��Jenn�������-��F���w��������E+����(�����'##���466�e.T*��;@hhhII	����c�w&,��s�N�>m?��zWU���h�X,�7o������!e�4DF��4�///�PH�}bb��;c�O�F��n��0��A�}�_���i�:x�������������fB��������kYY�-�A���}||��^bll��%���6z\���Nww��aB!!!�k�������M���H�R�'H$�'xyy�o9�k��EW_Un�~��MN�X,2����Z(�S`2������F�###���UUU��Mg��r>�O�0�&����������wa �yUTT������Y����b����:K�V�[ZZ������XZZ
�555���k.�:�'�������D���T1�X\]]-���f��E]��k��-��F���Z]]������`yy������J,�y����rSSSUUUSS��-C����WU�V��z5g/#W�I��9{��`6�cbb���FGG7��J�2**���n�������z�)333YYYmmm$�����}���5g������[ZZ�'��������������\�ReeeUUU����leeeBBBtttJJ�H$2��.��srrbbb����B!������������MOO�Z�F������U����w��o�9������X�V�����x<����M��-"����\�R�������6����UUU<�d2�O���JOO�j���@���J�yy@
���DFF����<s�����d2����x���%�^���255�����'99������prrroo�V�5��������999�?O~L6!!����d2i4���J���&������M�B������������Z�Fc0�bqaa!y�����������&�������������p���s�P����������L�h4R�����HOOW(%%%���������Baoo/I�VkJJJCC��:�P�T			�V��� ���@����xbb�Z�&?M~�����q�/KLL$?+��Z�yyy###���������0//�<������f�9!!A�T�4-�J�F#�n2�bcc���M'�/))ipp�b����15�������b�dee��f�Z���I?h�\���A�tqq������������
�"77��l�Y�*++VL�h���g��
���������888�bi������M,3��l���'����233�?
2�;�Z�
���t��r�<55U"�����`KKK���������������lhh GM&S||<s��/�E"= �X���z�:���,�JWL$gW �l�Z�z}___RR�����hjj���`������d�ktt477�Y������hfv���
���%%%+�o�D���d�glv��� s\\�����Z����K��sKKKff&��#��V0������moog�J�\yy���M�������7��"�h�u�$���PZZ:;;k��
�"**�)X��dcc��l��Z^^������\[[[^^��U"�444��]||<s�����lRR����
���������|�����h����HK���|TTTQQ���244�b��K����2���SSS������d�ZoX���V���q��V������+511���[����������a�R�����iyy�����f�TZSSs��766�9��~��j---moo��t���7\)`������ M�����KMM]}���,f3������V�y<^{{;��b�:�T*y<���X��&�i�8������b#��r�IzzzV����������\���KKKL�������M']K��	�\����R�n�^p�FFFrrrV�����H�L��'��ZmBB��k%Z[[���L����JLL$GG��������XFF��+����d`Obb��+W�u��������(���D�BA�������kOOOVVVrr���S�DRVVF^yxx8%%����t��D{{���8�III�/Q���KIIaj_kkkee��uvv2�/j�:99��udJ�X,���!��V�A������V\y�X�:�[���������/FGG���K��5Gi������DGGgee��.����X�1��+������[TT4??�����E�7==m�
V��������p���\.��};�~�����o��y����������������`SGG���������w��)��T*��#G�~�i��322^�ug/)�v��YRR��A�@P^^N+��-[��4��'�x{{GEE�T*�./��F#M�k����|��Wi������Ozyy����O=��F�q��peu��������M����
��L��7x<��W���l�
���)zkXQ��������y�������\4���A��������0S�o����K���k����XF�0u����������O��D�����������}���_|�l6;{y�D��n��o��uuu'N�HJJ2N]R�����og�.<��#[�n��/955E������~���[�~������t������c�����sI'&&z�!�D�R��y���?����?]�R����|�A�B���`���;KJJ�:���z�������;��M��v����'������9oy�d4i�f���S�����c�N�e������~�^O&zxx:oy�����G������f���{��iz��Mf��L���=v���W���l�
���,}�������� M���m�iz��m���d�G}����S~�������{���i��_,++#�{����z�-)��:h6�w���o���>��CdHLrr�C=t���W^ye��=&�����&��B�?y��}�������QQQNZF������[4M���S��}������}������o}�[��p��Z�/��Rgg'M�g��=t�����q�����}������s�=��<��#����^4nY,�'�xbpp��;v��p��^�����������PUU����5���g���>99�p&''�/q7�/�Kxx8�?~\����}�QHH��
�sO=��T*e��������F���������GFF��t���W��������������y����������������%�����~���m������fdd���C�w���d�S�t�J�jjjz��
����C�����o���	ljj"w�����O��7�IKKs�2p�����{�}���<<<�x����D�����~�^O��������Dcccff&y��������4�i�&��L&���;vl�?NNN�����I������7o�iz��m���d�G}��E�DTT��o�IK��g�y���_|����L|���������\YZZz��������wWTT�4�����C<x��W^��g��dr�bpe~~>""���S]]]�����'NDEE-,,8u���jY�`0�H�3j8J�@9�C~��
&��d1�
��Z�fg�%��I7��V�$�U��f��y�&�9�?X<4�;4�[�{��tI7�����j�fu��
����z�No���jYO�5
R�8
�A�}���1Q��&��\+U��4�[��&���k�1�&���5
v��������}�����$H>�"E�;\�@���3jtFM�h�����5�B��q7�}����(����t�v���P�_	����<����:�:w�������������R�Bk
](��D44�V[i��7'zf�Q��E��d��cu)m��W�XO�#P�6���������f��v.i��=��W�N�����p�fT�No�X�����%;���7�� �����+I:�����6�}R�2^�PM9{��.�~�}R$�s�&u�@�(TS��5�s=��V�5�5$T�wyJ����6����P��h����M�s�����h�oRW	�A�f���t�q��a~i���,[��FqyZ\�������C4RU��:0���=v��������|�oRW	�Aw�5��e	��&�l,���T�>��;����:��F�C~)m�Nog\%�43�[\7��/;}c�P�8�J�T��-�b75[�
C��u���1���*����9�4X���3#u�����u�����u�)f�����K�y�z�V*�Kl�B��f1�\qz�ZQ��*�+�I'�q��E�4i���_���@�M.
w�J'���:@IOZ��O8\>4�{E�Q?X��qc1El4�'EmB�����~��u\�Z�d�?8���-�B�:,���%��x�.T��/K�+'��	r�xJ�u���d1r��Z�Jz���mNo��<P�-��r_�.y�h�R���j�K������SM\o���!��g��������(�Sr�8�� �Q?T����t �i����;�%��SB~
CeN���fT���_"�z����10��?+k�����E4�����,�
C��M'.O5;}eo[]Qt
�+����Q���G�ud7�������E�_L��u���J���B�.y��?>�������E��;�gFZ7P���p��"�m��ma�]�]I�:�/u�v'v'��&�|*�+�Jv�5�T&H>/��d=��
���Ov'�v�_�����H����W�����:���+ix�wbi�W�*����Xf1%]����d��������z����\�K;bC�*����'�0�+���%�O��^��IQu~���P�_R�������<v�0�����m+s�\��&���lr�y�S�~)T��.��z���}R�4�,����-�������v��,���<Z�}9��+�����;�sZ,��6�v��Y���$gy��\/O��}9������]	\A��;��b2YLr���|�� �0`lq`lq��td�?Jt\2�wz#��8R��~�}R�������3��\r@�^��L�W�g*��+��s;�YO���.�����'-�r\��mp�G&oa�����3�N���J���_'S���G�;�au��+1�3>O��"��9���:x���*�k�}�����wL5k*�v��n�v���/��z��+r;y��3�TD� P�N��l�]Q���>�J���\wG����������P�_F{��W�U�~�$�)�<���
�������v��vq�"���|��B�	�:x3�&������M:S�Qk�ZG���� VG�h�}�b1����t�_Q\v�:�W��(.3��b%����	�V��+1T��z���/K�MK���zl���l����4iD>��d
���)��)m�����b1[,��GYL/���{� .�������g�����#��
�z�,�=3����[���1L�F�����/O�YO�(.�����#V���&]j[X��H���|���#����4�dn��.q����\�L��=#e=E�Zt����-��)�����P�_��X��\h<.�g����r]I<����@��v����D�YOwZ��OkT���0��&��/'�r\���o�S2^*�Kh��o���;b�A�O���G�%gB~m�U�J�������/�9�5C n-P�g����
�q��\��{���p�A��������Gc����4������\Wt����kC2b�O^h<�����XLq���q����?�|����7o������c�U���{�8e���GK{��M �g��"�@/�dc�������r�x��^X�lgy����H��]I��]�n����P�_d�1��3�����8ay�����cb�)%�����s�L&S�TG�y���i����x�����`]����L8Mvz\����@�JR����0��L�F*E#�LS�*�������s����
�:���M��F*��(��:&:�%�������Jv����������`"�,�e)�re^�u���P:���4-����c�R�e���?��oo������(���+���%o
��|y����I��������d\��,^}�B����
�������I#����U��*���G��T6��?+��r������W����)��J���{���f�Pt%'������5�u"�P����jtN������W_}������O>������o���SOi4��O�U�w��Y�f.��Y����N9�3<9���v�I��O]���$�Kse�n(��6������S�ma������D�Kz���U�`	�Ud+Ml9*8�FNi�k>���z����$����wz��p|Tt�Q_����c���KKK4M*
2��7���x��y�r�q�P� �4�����W6e&��wU���~��1��)��)3�1��4N���$
s��"�E�I�)(��>)S�.u��/wkknn������!POO������Nn���
)_�r�%o��9�k*���������/���07S����N/(���K�4�C3F�T���x�>����3e��������k��������F�8=��4R/>)�����n���3����� ����U�{C*c�����m��*!��{���-"��Sy�1U���j�W1��������w�'�|��%����?�g�oo�����f�y�?r]k�KC~'JR����������*��[�##�J��E��]^i��R���u���Y���%���G��P�|���"����N�8���d0��GN���\�����)��$9�����FG�t����Q�.�H\4���|?��w�o�@A�{$E��Eyw��/��Y��%��wW�	u�����v�0'�JlC�YK�;.9�l���dN/1w~$�>K�t��r�H�p�G�����%W������p�:!�
���F�D�U�l������;u>�d�E#�`����8y�����LIK����"�'K����^S���K��������g�r��?u���C�s�-�p�q���u0M��J�fI.�b1��O���1��8���c�
Z��;�)f5��
������J��s���/Yt�*8,<��2���6Mq�&�Q}Itm���"��p�:X ;���Q�d0�Wl�%t��,q�����>���?s>�|���b�r����f�'<��w�����V�%Rzt�G
#��zuZ�Z6f������.k
$�o���u���c>}��}���E������rU��D7��z���9��C�sY"w+�'V'���]��y����e��:h��7�����$��zuq���N�X�!T�4�]7u@�xt��s*i._��4.�T���1U.�����L���D7�-�D���]��Ur�;r�mG/����~q�D�j����h����QC��Q��K�W��i{&�.�J�����x�x��1�N�Ki����E�.Tg6�YO7�D�C���z���]�����3�e��"81o��06��m'�%�t�:����,��W;�u2[�v@nj�D2w�K;��-��w���I���Z��W�L�������)#9��%R�M��v����o\%�Z��\�w������pJ8���1I�������	,�-��4��k���u�.�M)
�T�����\w�����D�a�ib�<:k�������V����$��p��:M�I���(j�y��l����
SGF��6����<���Qg�����^����A�Pakz'���#
i�\=��M�����p�X<���C��1M�.h���rc����v��]c����3�����ua������mg+�;ud����M��:��NS���HNKu�����^�2�6�Z*�
��.��l�����.!�fbt���]��u0]�Nm��0��4�A]��v����YS����*L��F�S����-���� �s�5�s��84c����k[����Ek�UqY>�Z�yb5�&\������-	[����'�\jT;�$��su�Ls?����a}���>g�Wa7���}�N��&��Wr�����l���.��D���v���&��t#Y��h�9��"���6.m�r��;�o�u8�� bC��m;�U+���Y��c7u�:88c����v2�������:�&��(sz;�*!��S���]�WuhK�l�IiP��������S[��i��B6�bx�T+���44|9"�U�z����P��Q�d��'������jv����+S-^}�����:����N/.,Fq��� �K�q^;��3��k�������9������D^�1kq��1�t��,����nZ*��������&u+��]*����O���P���l����3��K�^���r�:�/��O��5�����c�yb��*9A<�01sd1ud�*L����9�7m��9S�L[��!G���b3m����HK�������5��������~=DS������YS��v%��)l���.���i���!C�P]s�J��y3�i��!_���9�?����2iL�Z���C3��
se��
���n`Ic]P[�9�������F� ��N���B�/�`Z��q���@�S��	��w����N�cz=�1c�HS��u�4_�����`-P������d��@��@ N�A�����{B]����@t��C@��uu�@ ���w�D ���:x'@D g����-��&�g��o��P�l�!����������@ ��A��\�����&3�r���������-`/���"���:x'P,[�����Mf�+��
�J;��B��B/�q�u�2d���s���*�9f��e3M�{�"����@k�J�
��z�^��{B��x�������������h&���u�.1 ��=�~'�M�H�1j��y|e�Z������.�-�m�l��_��Z��UXL�{L���[�@��%&,���
��9��v�Q�i��)����e�eYk��6�=�~j�X�_p_Yq�@�K����>b
������R���zC���-S��6�v!������������A��V��KM�a�[��>�@'�3r���������	������&upb�luD_�s�������P���U:�JgM�M����K�K�V�)�^��R�i$��40��8,P�V+Z���Em���/�A:�,[i@����A�p��-Y��
R�i�4Y[�m"s�2m�j6v���J{���#~8�11>g�X�?����|���!2�g���-kc�)��8�p��e1^1������@�K��-����%��7������6���T�p�v�����J��7';��o�Y%�_��M*���u����2���\�r����w���>���w�e�����\�X�e*�`-m�b�Z��;8c���,/���81��|���������~#��=W�N�Y�6���5p~��l�;�(T,�5�"�qt�Z��5a.�0��vMp���d��*-�����A`��)s]7����9sR�>��M���|�)���Z�Uk��M���}N�3r^m��*7�.i��=${��������� ��m��Y�E�z���M�'L��z�_�5y�Rwz�Z�����c�:K����2K���48��m5�r��Q�b�W��Zfr�F��Zo}5X}��������?�k��,�n���������5��z�u|�v���5z#k�c.���ZT[���@���7�X�G���v���h���/}�
�����Tg��)-ns�*�A`��de�����!,�p�)h5���Py������5�U����d�Xq��g�F���}������l��U���-�.��un_����S��*��B�-t��vUVS�J��C�ho�jU��]�;y�744�<y266vii������J��^�Zo�^���Z���q�ub������XL_>���,�q����#���YG�}N�����z:��C�`ZZ��?��g�����{���X����x����TV��:t��z�L�nT�w�*���eD�y4���\��Y�����z��������[^^~uL���g��D�^"p&��v�0��'��F�S����3��>�����/���a1��I���mT�Jo�ALg���N�m�������?<�����jVi�M��x��];^���,����zt��M��c���������!**��%��l��U��v���������^"���k���y��[o���9{���>����R�$��m�&&&��D�366��/l����G���p���*�Dr�����2v_�d2%%%��uuu����������������kZ�������@�X�L�����X,�Hv�����O���q}��}}n��5Mtt��s�
�����!n��]��h|��w���������������GQ��O>y��Y���?��G?�M��O���k��C��x��'�|ryy��Y������{���<`k4M9rd����~��SO=���E&�;����={�������~��������[om�����l��}��f������+_���S��������������s'sSV`��<w���+11���_V*�����6m�iz|||���,��+_����M�_��W�����w�y���3/]�t�}�}�����s�m����.�����sXX�������_~y���$�#�<��,��7�m��������Z�J�������������{/22���`��<w����������k���s��` �������|�k_#}��~X�T��E�9s���tuu���?��o���s�=,�8q�}��T*��b���w�f�sq�������R���#���6�}��-Z���iooo2���C���,������~w]2������<��_���������j�111{��eq.���/��RcccHH��C��Z�P(��sg?�s!�����;v�?������{�;������'?��X�\����~���|aa�b�������?��?��O����F��~�����_��W�}��/�����t|||vvv��=���,������~wi]]]'O����U�T/����-[~��


�8�v}���x������{����_�Bgaobb���^��s144�������������}��=z���������w��m�6m��u��]�>����[��o���w�������9399������?����N�bq���6���|}}x��;vddd0���h�XRSSCBB���"##����}�����cq�q�E�1sa�x�G}t����<�LEE��0�������i��g�>������������Y���M��?n4Y|��z{{���rss5���m����:�������w�����~��~�^����z���6o�|����������b��hFF�K/�D�Q����]���}v��c8�����~����l����KKKd"���`}����e8����>���������o����G���v���O��|��/�����>;11������|�����Y��okbk�y5�H�{�����^/�J�x����n�f�����3�,--�������K�Z�

z�����Y����g�}6--��j�����o~��O?%gl��?��h"""�=c�������^:x����?�o��o###������w�=O��������������������<����6��	��������f�Y������������~��,���z����=�T*322���O��:���G7m�����r�v:~�83��b���O:r��o������4M�����o{zz�5�Bq����~�����d
=5������'��T���C,�L�������8q����L���[�lak��������;�>��G}T\\<>>n2������dl�^�gw����[�q�^^^d�����x�fl�g�}Fn���������5��
���u+[�p��k���j��<�������?22�:�x�~��w�{�����_�I-����y<�uprr��i�T������~Nyy���?����T*���r=�{��G&��������g&���{,�����'?��N����*������O�5���������_y��Gy���<o��}*�����{���m600����>�(��'���� 8��������!����7$$������/,~������o�������bk�{4\��o��dL�������~��'��f0N�8������G}��������g�}��������C�)�����k�.��;��j�g�}��o}��������w���{�7���{��733���7�|����6�w��Z�|���;���?<���������c�UWW�5���5o�����[�z�������~����?��w����y��������������{������Gh���c���~��[o=��c������_�������������={����?����������_���?��s�fA�G�u����;w���3Sx<�}���:�~���W��H�V���TV�577�~Jtt4�w�Z�����UN����XZZ���H��s��a`` ..��U)))sss�����������@���u*�����g�,DD���<M�MMM�����d2������c������� �4\�BE�@`?�������;�NTT�.���}���`�8���8��s�^x�����,�r����KKK���;�8������,:::v�������fE�����N:�
endstream
endobj
20 0 obj
   16185
endobj
12 0 obj
<< /Length 21 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wp��?p���u�y.�(vd_f��9��9�}q�8��Ev���]�wN��u>�-��%�v�*$�(E��{{��	�{{���o�W^# E���������K��������l1�L������Z����v�WZZ��R�l�������n���


������,dTT��s�L&�z����|��q����Y8��mhIYl�Px��777>�/��W��	�����*www>�)����`0rssO�8��������������Z,���L///oo���|��������G���t\��N�������T*�����G�:+�?�~�j��mmmuww����<�T*������o�L��v��������b���������������g2�VN���g��l

��5���q�����	�������roo����\���]^^n�����<����+�S��rWW��������#_[���5v��%�����\]]gff�~�����knn�����<X^^���rrr���	����x<���1���T*cccy<^hh(y�T*��xG����7�LR����������N�JMM%�SSSO�:E���l>}����7�����V�Q�A
������������J��V�7��;6���N���d�Z�;���299IO�������������jjj�'TUU�x���l�tuu-..&����]]]�gVTT�x���1�V�a�[\RRR__��Y777__���b��B��r����'OXZZJLL���:r�Hzz��d���������<r�HJJ��`�_A�����{xx����:u*//�h4^o!���]]]�����=�/^����,���btt�����#G�\����d���������G�������������W���=v���#G.\����d�v9���*��		��*$$�~^7�.r�<66���������'--m~~�~]]]]&:		�p����Z�������sss���
�H$o�J������*<<|jj�~�����]aaall��������t��T\\|��I�R999�M�&==����aaa3337�gff���<<<�?�����h�`�/�[m�6]9�����HXX������GPP�T*uxA�u���tX��\�|�L���V~}���/��x�#�����x������8//�������������"�VD��'N�������������$�S��u�A������???������������	����A�o�X��.$$���(''�������1JUSS������4�#�������d9�_UPP���N�>M'hrh������Y\\����{�+�8q"333..����������UZZz��e�w��e��rX���VOO���<==]\\�����]�9������\TT���|||�l��������!F�������9�����xg��-((���&�7�W#�y�����������`����r�ML��������yyyeee�������FFF��EgO��p��Y2nUXXx�������D<v���
�G������U�����zzz\]]�����!I��X�����+�bU�~�b1��������'?M~z{{�x<�VKO�j�d9I3"""88�<����|�Mt)l�[����������n�5$�u����\�s�J�����i����x���pz4A�T�x<//��-dTT�������H������Z���`PP��l&�dO[YYI�@�������$%%�w�W���^���6���i4���]\\��������x{{�}������������bz
�Y��������x�A����0�4�1�����jfdd��h���!9���p��;v��}'�@ �� QURRb������8��V���������PrXm����Fn�i�a���l>~���#G�
���t��1>�O�/W��z���"m�x��~GDD����'��'O������<��m o/�H�2����+!!!>>���#���E
�x<�@���`s"����$z��h����3�#w��LYXXp���,..�����7#���=Et�hww�:�h�,����)###<���K+�l6����#""V��B�X���6�����f�Raaa<��������������^�8ann�>s�������I~�D��JKK��VS�R�S


x<�C!�r��KLL�9B�?HQ��<���s�O����<8<<LO!���/�?��_6�����FQTff&�����Xu���5o�k���>??o�X�����;g_���������n��x<���=e~~������nqq�|������;g�Z-KFFY��
���|G����)��O�
���I�)��)�1eeeAAA�7�v��$�c��H��C?�z���)Q��s�S�����9C~;���C���<HzzNm ]��xs��b4��9�HMM����������_�
9s��G�---�Mpqq	

������uXM���������12�9���<<<��!���@���D��.yzz�?��_6���]UUU��������H:$u,�'A��6��$�����a2������a,�cYY�������gUlr�f����Co��'�=�l6�
�c�����UTT�>�5�f����^�z��5���]79P��|IG���K@@@VVVkkknn��kz�y���xDQQ}Hu��b�X$Idd$�����k������-����P(���N�:E^���%##���k��n�U���������w��1X�w��<��3�����&>�7��d�]/I���7��+�b�~r�������O�2��C����OXZZ����Q�gUl~������f��������@��J���'\�S���#<��t��� ��!z�� �h%rrrH��������H$��xUv���<O(�g��L&Soo/�������w��sp@����z�Jeuu5�
��g�<x�M����z��V����{|qq�&��}���g�����A�P�v���}�������A2@g?qii���AF$��Z����.CJJ���'�7�8�`������)���tE�M�A����K�<�^_��K�x<���O�Yd�������OF�x��4�Z�������k�A:����t�_H�\N�^�La6��~���������a�ie/�����F?I��C�E���:$���<��d�<x�M����SD����juii)}�J������'�#���A�������B�7���0�-���m_����^ZZJjKV��z����Z��ji�����>��|�����'�������Q���5���n��������j���RVV����������|zz�h4vvv�CH��8avv6��kjj��Z8 �������3�����^^^nnn�!I�@qq��`���MLL$�B����%{N�LF�I��S]]���bccM&�N�sXH��IJJZZZR��d'O_�c����j�|����D"!/N������u���X�V>�������E�/_��R��V���8���J\k��n�U�nll����������l6+
����699���������e6������]cEV���US:;;===���H�Z��}=���c������9�����������M>!+����y��OO������"R���J�0���D>�����$�����[���p���3g���wV��p�<8>>N:�����[��Z��W2tuu���$�X�������'��������Y{yyuvv�O�J�������$�_�Uw��U���'g"��6���~!5
9]�L���p����8�\�x�~�E����4�{��/^��d"�s���{}}���
���&���jnn&{cZrr�}/AKK��������������{zz��l�����<��j��y��1�Hu�u����u555�oOO���V�'��b�'�����34��k��J>�+���*����
��	�{���TQQ���e�X������b�Z���***����xPWWWii)}�_V�����^^^�����k��|�m6[EEEcc���D�J%*++I���)�[u�HfKKKyy9��sx�\.���!�"+�f�
V]544d?<������\QQQSS388h?�e2����N�:������Xii�}���l6���hmmmyy�X,�Z]�V8,���x����h$	y?W�	_XX��������F���������B��~���?�+�|�u��l###d�������^o����LY\\lll,//ojjZ�����E�HT^^.�V�3�@ ��s8�����v��
m��a�#��"�����X,g��9{��M�N�9y�j���j�s�`�rqq��c���E"�����US�����^���������������4�@�P���A��9{Xa�X������������6��KKK����������kkk������������f�����������Ug��DBQ�D"�����E�pW^^n�]�������<V*�eee���aaa���B��h4���\.���������'w����/..vx�R��r���f3�LUUUQW�������{�:��������F�SSS111sss��pK�BaII�Z��X,###��677��������f�����W�\��t��@���D�yy@���DJJ���������*�i6�GGGcbbCbb����������rG�������T���,���lYYYff�����Zmll�����l�j�eee��f�9&&�~�(����onn�(������J���������<���M�X,YYY�}44M\\����:�~�Uqqq��������������nz�V�mmm]���W�\Q(�����CCC����R)i�l��������y��V�cccW�$}�y����#�T&�������/..FEE�d2r\Gn�N��l����������+**��0;;�<�������%66vii���d2��f�9""�<��t���������Vkvv6���,���===V�5==��6��VPP@���������


��*����C�B���e��V�neee���7�kkk�Q�@PXXHQTII	}����!��%V&������&�i�X����������T�OCCC��M������g�5������$�X���w���[������DzU*9�ikk���]�`����������hz��$/r�D��������o�W�8��: �l��C||<9$���"�RRR������up���eee��>88F�.))I�P�faa����P(LHH������]c:�d���t�nn�t���4���D"IMM���!�},������������4WRRB������T*o��B�F� ]$[TT�T*��]�P����	��O644��=��l%%%������UUU%%%����_�bq]]9������<�)��������N7���999�o#pC��fdd�JK���lhhh~~���2<<�P����F���������������x��v�<X\\��>d���+9������+511�2��l���������������$�O���111��������zsohhX���~��f+**jkk���III7\)`������ EQV����%%%��#9�.��+a�c1�z4MLLL[[9�b�<����������l^Y'c2���\n"Ioo���^NN���Pjj�}����[XX��555���z���%}��\.���W��7\/�u������+��ZZZ������t:]ll������t#�z������H���y����������<���<H
{�������y���YYY+Z
E\\�B� �IIIq�t���MOOOHH�?�������GFFI����?&���V��l���_y��J�JLL�s_sssYY��uvv���4�F���@j�����N����������g^��>���� 9�"///***,,���+����ViJ����������tr��=��a_SJ�E"QlllDDD~~��������cbb����_�f�����p��1�\�w�^�y����{������?�ioo/EQ�����q��0��������[MM��}�����j���G�|�I��RRR^�ug/)����WXXH�A�@PRRB/--�u�]E}��g������j�����$��DQ����������*EQ?����qww���z��'��j��XR�������������EQ~~~
��L��o��&''%�`rr����
�<������p�.P777
�u�y0&&���������W*�������srr���l��`cc��w���g�y���K�.8p������C/����bq��0��;r���_#����}}}����F�S��#r�|���t����<���������mjj��������~�������CBB���Ljoo���n��6���������b�Z�~���?��S���x�Z��D��{�B�p�R0c��}���tLJJ:t�y,�J����Q�����2�w��]rr����I&���(:�>}����<���w�uEQw�}��` ]]]��������yzi|jq�4�6���0M��������|>yl�X�����n��v��B&zyy�8qb�+LNNJ`��h*
���E���A �a�)��q���
��� EQsss{���(j��=���d�'�|���gQ���:�S�B�g����~egbK���������uuu?��O������R��/���<�Lmm�����b5���jl�<h�X�����+�>|����#%1			��w����_~����fg-'pOo�5��Udo�<���M?V(AAA'O��?�+//���


���s�2�s�jg���3�8��5��UyI�Xe�X��fc�i�Y������`���n
�?������s:e��H��?�U0��QOq08�<��5�U��:f�����u����iZ�4uH��xy6�w�Eo����*�Ld��|nP��a�9��,���+;9�V�������g��I�G��Tsln@6?���/�g���0���������m��4�Q]$� �G��i�f����fAor���>���*��F7�guF;=�l�@ �f����e��v��c���$�euL5���[����^�Q�����y���egaor�h)�{]�����,��\�P P�4^ �IS����4��	Afg��NEf]1��T��?+�=|ln���e� EQJ��DV�:���^�_�"�	�{�-S��S������6+r��\�P lcf��mRH7�:��[��&-���a^��`�;4��>%�j�����z����g�����X~Ob��/�i�X�X	��C����*��a��k�n� b�� �ve��J��b��:���+�O��&����������m�
�g{z�8��Al�@���&-�������_���u�U��� 6[ lc���������9���cgu2�����������Id�l�F4�%g�+30>�3y`��X�[��� "�t lW*�<�;���.y��� }k��u27��>E�����&7w_B��8����Z'���u�U���h�JU=9]q~�@�����f��D�x5�+�<��q��{�[�T9EQ�Z�R3m�����VSk�4�W�e5N�/LE�Lk���g����t�M8V���+�KP�wRk`^O���	r�.��mR��� �lS��d����M�L��sd�oI?o0�c%	��)�Y���3���78Xt��O�=�RE�T�^���`3Pp$��Lnw�Y,E~OR��D�h��v9@��."�lS"�u��X�Y��m!���������6Q7\��*�Il	 9%�5���oRk�I�>b%�C�>i�al4[&���+���Al�Pv�L�gw�p���z�z��e�p0>��l!-�#O;=�l�@���w`4�U������N�/[%��+��'��l�1������y`���N�IyyP'�cc@��4^�:�A������ 
ypC1���UJdu�)�)����ko���Au2����J����U��*3:"�F�m�BV�}J�����y`�B��F�a�,L���Jv�i�a����0���K�{��<����1��N��i�.rzr�B�<�]�Nf�!�a{����x*@���������[��m�|����A�U�):*�{�-�)[.��`�Y?R��Z`|pC�8V��.7�'>�-$�% �#�LIme��������y`U�������B3��LG��m��e�l6VW�f�i�j�'��������+m���[Fp��+��!UO�`=��1�T;T�`�g��`��)�z����d�6�3mq��tN79=�l�(�K���/8}���8�VZ��q������"7��/M��&o�����:=�l�h��$�wD8}�|����q�P��.=���Fx�K�JRZ2����� ��>EGB������S������&�MmS"�'���rq��?�L�d;�3'*p��"0�}J ��u'8}���2zgZ���N�'�b�v'��9H�$\�r����� VUUy�5����n�L����y����[%�eu)m!��E�Y��+>���%���z���`���o1��8K�$�9N_�m����?��?�(����;��� 8�'���#�T�	-���������������������O����/�3������%6�H�,=�/�'\�Y
�+����wB��l�_|��s�����s...�^"���7����� �Y]1�[�������,�l���+l���E\r�T�;n���`||��~H����3�<s�=�<������^4�A�����1��R�V�;}�����TY��<�i��]�,��,��`�X���r�
2��K���\c�Z}����!�|��/_�l0����������8{a��fu2�������))&|f�����T�)hv���+[d)�h35>s�D;5o=6O,//��/~�������S�N��>99)`�P\_�XP(�,m���&��C�}�A	���f������d��������N�!3n������Fx��������l�{�7�Uq����.�����������N�#�?����/:o�`��6����������X�m�zdu\Im�`)�R"tz�b*>�U�����5�rYMW9��J�����\��O�����_�����&�i``���GGG��t��l�<�3����u����z4����������1<cr�.���}�k���tsll�_��_w����#�8u�`g�S�_���tz���j�p��*��J�$;I��Fx/'A���R��N�3#����6jQ?��L�:f��S����Hwz"�y0�-�xf��\;�7���2�����k�$�7�M ��b�)��X-���'9@�����z��������q=��E���\�F8KR�F+]T,�����r,���&�I[�������j5"Muc�������@����7�b ���{�c����o�G��Y,���3��y0������P6O�6Q��M���4h����YK��1�I[�����0�����5��_�[�A�Bl6[p���w����Q;\(��*�zfZ��)~nw\�L+S�>E7�Cb;��k���K��3y���N�>klG�D�!��u�d*�_lDm�>�V�
�R�ak��lC��h9���kZ(8�p�\h�Jkp��/�%����j6���O��T,fC�}r:�<Y����D�T�^$Qp��4y�s6���c�fY�����,��<��t���%� "����\��G�(�J�%Y�L��VA��K�����-���ypxV�2!`5	��$�O�S�������/�I4Y����k�9��Z��M�|�G�D[������N��^���#�<���\��W�x��aa��P���l3�+��<�,�4U�V���U5��������7#n(
��	�����_/���\c��X�a����&MB���U�<��@���IaFGd�x5��euF3���Il��g=	�7�/��%u2R��F\*o��Td�+%��2����t���#�?�;ucJsE����Z����\��X��YW��z��k2E�z���$X��K��� �#�v���5VcP%�T�|�y��p�8}� �;N"���j3:"�m��NF�_,k�P*d#���:y0j��A� ���DY;�'�	�X,D��
������fY��s��D�t��J�oA�GZy�3�:��Bj{(9�Ji���g����������s���W#b69��FM�P�8�z*l���	��T��+~K�t�&O��7��}S�����i,��e`�IU�a��	������n%�$_�x$����g#>��K-�,uZ���,	�f�I�e��C`���!C3sMQ�r]"���e�����zY���WW-�YG���C�l�����A:��Fr"aV��
W�\��S�<���!'� 0e|~�~��}J��\v+�)�����l]j�=i��KrsI0�-��k~�%���F��L�^C������Crs�P�R���0�Y+���f�l���s�����5��X^S�-������Y�kh22����SY�����,�F��Dv�y���>��Z^R����q�|z.	���lD@������
��b#�f����u��%C\�"��
s������a�y��YN�d�t����N�eo���m1�KY�cJ3�;C��M2���G�#��6|5���p�wy��:�P�������A�Nyp���)��A��d
��R���AZ���7v����2��7:{�	�j�l(�I�-3��X���GX�P7yp{�"n48��Al4z&L�Ac�����t������;cp�Y�PO�v����J���|i�����x�`^�P��yp��<���UtK�Hf<\���YKc1�	�{���6bD��� gu2��j����g`����.��j���N4G����d��`��R�������K�<W1�;��fJk����_$HY���s�9;���(�����Gj��a�5���A���@�	���?F|��G[��T��|���F�*�+���gl^���g������?�p���S����xpS'#S��Z�M�y-�G��M��;���X5�[�����H�dgu�dtD�t������h����E�����-l�&����Tl'A�+��5�������YK��1W�M����*�c����|T[�����/��yp�0��N�_�C��&YuZ{�%$�*k9���#]��R�Vd�X��\���Y����P�)m��(�#
s#s��A��zv��Dl�@�!�M�|������a)��^6s���qS�d�Q����/KY��L6q$�cyp'��4YM�p���-���'�#2��7y0�j1�q�nt=�q�e\ea�90eJ,��g�[�dVyp' u2�����tF����$�l�����z(?�;>�;1��
P.��`t����<���H������T�|x����Y
n�d�yp' u2Y�QE��������� r�������t)7W�N�Y��Uv�~:n���A��@�	
�	���L+[7��9��ss���QJQ?�g�U!"� �w�/*�qz"��8���Y/_d��7�0Wt�
���n=���A ����`^�Jm�n:w�9+I��JH��:�YM��=	��~�b)<�U�uKlWf^-���l�R�� �z n*s:�R3m���m������#Y=��$z�tAsvW[y�;�~v����������Av;E[t�M�D{�J&�\��>�b�5�e�\��eyp����F�!��R��?�8n0�
f=S����3e��W�9�N�����y��Uu�Y���M��fL�~Cu��J��!UY���K���W�_���q+�<�t����������YiZ{x��|����/�HsP)�H8Yu</����$I����yMe�N:�]�GI�.�I;0e��Mb����ew��Y-K��K�'����?�Vz&����~���J"�K�A�],�y��E����";&���~�4�#����h�4f�=E�������}�..ps�I�%A�<�@ 	���`rq��k�5����<�R��2^(`�
���� �M����6���-���	�S�X=���:�Iy��&F$�5RE;{y���������2X���-�d��?o�'c��8�=Q�X���9wyMU�u2���<�ti�a~�D{y���qO�����l�d6jk��;%<�YC�`�����0��j�6k�����N�e��k���� \�d���_6K���Zf�X�X�|'q2k��>���f�� ��yp3���Pv��U��U,M�Wt2��lO����uevTq����^}u���f}�aLi.�`=�K�E-������\����10�����B���M��}�������MQ������?�{��o��!!!�^@i��QMg���"�X�h�2D�s�v{�5?�&y�D��'a���CB�7�@0�$>��c]]]�S�x�Z��D��{�B�p���kt�?^r!L���K.���r)��BNw#�a��cG9��������F���k�v���������������'������������������,"u2��,]�3I�C�q�j��Z_[�3a�S�Y�.��Kqf�`,6C,**���;�{�=WW�G}4..�������`0�'������9{1�R?R'������`���[f,n]NX)
�����K���jf7-�����������������xI:z<��&�pzl�<������J���>��CE�~����L���:q�������l}y����s�9U�W�=X�����@O��n��Bia�4�V�P�)��+��0�,��f��VW�J$}��������N��M$n��I�A_M]c;S�ZQGmc���@�'F'��&�0�v��(j��=���d�'�|���1[LbYM�4���������=�L5#%��!>/���+�DV,q���m��6b���or]23/I������k����>������4-V�Le�p��G|���h]��`3�� ��q�N��p<��o�����O=�EQ/��Bqq1���3����:e�U���#a"������������f���WF���R�����<V�1�[��+�5V�����k���'5M�8Wq�!_�h
�_������6�i9�j�k��8�y3H;$G*�q������#�<B�{����R������������0��NY6�� ��V���U=bY�DV�T�B����i�����N��m�������}���/���B�Z�5[����Y�F-�����:��q�����Q���!R5;;|�����nzbyy���ohh�����LmXd�
������St��\�_��F$���k��M8�`]�1��u91��WCk�
��_t]!�4����eu�\��V��,���$yps2Y������6���_�?��x^�gqCIuK,�nP��-����'X�A��l�%3��c��j���:���o��j�A_���0���WvK���>�z\B���<��L[�����j<SL�d����-��.�������7����"
jz�v������h�G�A���q�.����yp�15ga{s��Y��=�;.��0���n:����j�H�'u2����n��F����k[�b�M�3���*�l�E��	��/Z��Y�x{�uy0�n��b����ypml������35WA/�}s�m���:�M���1�+����>Y�l1j�w\ ���<_��P�I\�.Y�D�T3S�|��S9��4��C�
u�-�cG��M��|y����>���N3���,/�i^<�ia����;3���q�LU�.�V��x���o��TS8`�$F�i�����\����&�n�w+�Nf�15g1�l���=���������3��o]�F�-V�O�D���<���dX� 3����6���u21��py��b{�FP'�cc{|�Y�q��V�����d|��L��1�0��2������^��^<����������>����l�~��7���\��A�����<�6.�����J����5��~�8��Q��Njk��-������a)���i9��EQ���������?�\.���e���>������z�|D���
����
�����)Q���d�~eg������"��1l���~�A_����*���e�=)oW-�W��������t*�p8���L�Y�e���Hv�%{�����u$]��@m������h�����m�M
�&f��
�t�2�f���0>�3yp
�N&�'1@�Oj
�*���v�H	������%F�����+gY��Q.�������4�vDa9��y��ff�������R���<�\p�-3�w�7��6{&��3pypg��H�LT��P�O����mZ{8��s����D�Y�����f�	#y�3����}�c�	�W'5�W��}�]���G��hn������:M-#f������U����c%�Y�S|^O?m���A�� ���e����%�jyv)4(��E+SM�����x,C7�DZ\o �aV���Km��y��:����:���r!EUv/��6_;=��&/I�W�(@��zyp
�_�� �rUu`E��8N�����]'��q0>��������,�(m���kzq"�zyp
�������r���q�R:�����:����<��_�+�T,����6.�HM�)��P�mb��M �!F 
�}9H���guCc|`m������'��l_#��<8(7����;�r����e�y����b��;i�������#S��SE�����`;	�&�"�tEm����{L%��V���!�"�7P��8�����]��^(6$���"%��C�S�Vr&l`�Q4x��6SM���//��y�Y$C�/��d��^��%�������������/���/4,���:kb�������|���+�le�������Kd*K�������K���M��_�l����
�X�j��":K���3�������g���-��	��,=}���	KL����H��,jc������b����u�N��s$�"���2�/OjX�����5�����n�h���g��*t��61k�C���4
�]������d���s��dd�-�d�%�v��z�Z�d�����3��!����7qQ'c0���d�p�"������`�PS�����6��iKz��}��xsVmS,r1Ba�Q�r�G�.���#S�t=SM�$�x���`��s5
��:V�`X�����e[�r��y��z_�3y���1X���6�e��U��|���P'�4M�z�N����FMJ�&�IC�S��C4�h�d8�3�`-�2wM\��������5K;���s���r����q�~�����	Si�.�VS����01���2
L�g�6�Yl-��q:�:�����YK���m�80}m�0�4��a|����Id/�VBt.�:V��'��W�5���
�d���N��<���|Y��l���Nf=[b��L#����~��R�o.i��z1���X;�Q&�WP'������������A����
��2<���K+X�|�L��d�RL5��������i��K�T�����Yi����fvK��Q3�~q5KW�Mf���j����MX	y��F��I�b#
����"�<�A��z���:�A��Nf=���t�� �P'�l����]�Ny�{�!v�dZ;0>�D��7|������~eg��������G���u2�*���p<�;.�'!�'!Dx��f@������b����?~����v���O����|��������_1����R��l�������djjj�������V��=���OR��������������T�j��5\�p��%%%�����]w����>�������?44T�V������p��Ufn��Ay�^jj����JQ������������z��'�Z��'�9��$X�����U;�N������^[XX�(���O�P�����ocbbV>�X����9��I��q$e�������j�8WR�~�W�_���������t���������v���������^��/���+���Vv2�������w�����d�+?cb�~����Qz���{�R)y��_�:''g�M�������	�FV9|'���3>���x��w~��g�_�(���K���8t��/�`�XV�#�y0E�|���9g�+;����555��L���������7W?"c;�M�[G����E;���������S������sL6/�'f-zn��;�z27��3�����U����6��*�8�N���9�Sz��l���\��`�iv�� �J��u�4�`��y6����~�3y��2���$]���ly6��I�_���uF��h��Y>�-���&="��L6�Tc\e��G�V��������<y2""baa�������������/�x���8`�Z��D�9x�`II	EQ6�����
��^"����gnn�<����/]���%�����N��zyy;vl�sBCC�����r���e�����<vss;s�����;���$��|����dg/w>���?��j�Z,���gbb��K�������{n��]<�@pp��`���'O�,..f�e�fs||<y\]]�����j���ooo�_�f�egg���555���bp.V�U(�8q���3���l��b��m[nw�Vv���B��+c�o����[��dz��w��o���'������~�����s��18�?���?��O(�:s�����]\\~���>������L��������v�m�S��(������������x����t2���,,,8p�����?��c///77�7�|s��=�/_fj���m�{MM�#�<���}�������������{o��}�E9���~�i�o]qqq/�����Rmm����>66FQ�L&��w/�s����655EQ�������q2����f������;������>���={�tww�_�������"<<���^��l��b�<��,��7j�l��|�;UUUj��?��?�y�2����		ap.��������u��IWWW��t:�m��f4����w���\���o�c����ii�L��xg��ep.���?��O_y�������|q��;�T�����j}��w�����^�<y�����j5�[�}C��v����t:EQ���d�������s�v_�����������8|���~��g�}6$$D�����<x����������


/^tqq��t������`p.�����^�t�������;F_��d2��g?���~����������/))����Z��app��>���~��,��7j�l��������~��_<��s?���e2�R�<p�@mm-�s�v_����������'Ofee����^z������O~2<<��,����g����{��o���{���/Y__��,�MLL����f�{1<<����������R�^y�����38�����������o��{����?��#��)����m���d���9{����djj��������>}��Y`�o��������u�=�<���)))�Df%Z������/.,,���|��������,Vb��.7s���bbbz��]�v=��S�����b�����d����s����w���k�MOO38��o�����d21��+I��S�Neeei��������{7�G���r\������~��_g���`0���?����v����;|��7�|��N����_|�F�}�������������`�������O�u�]�������@&2^Vk��O�)))y���GFF�x����l2������������>z����~���������}�{���L��T����Yp6�jB�������5����>�hOOS�HIIy���JJJ����i``@���:u��_djdE"##�~����d��aOO������?����^�tI��3;b������_<|����>�/��/���2�����>��D~�����J�jiiy���RiLL�w��]OX�����7����;;;k�X�z�����������35����z��g�?��������k����A�?~���o������8����eWV�������G�8�����\.�(������}���v��)�f�P(:�������La�H��������%�������	���!/������F&������Y�+"���~���z��O>)((��df���Y8T������l�������Awww2�������{����-���/��?����� �fq�2���w35������k�F�!���������"r���#���.{������M����k����0�''')�jmm���y��O�)))y��G>��S�Z�^�(�s�������233��}z�{������

~�����z�@����������>�(S�p�)MMM]�t���_~�������y��W�juMM�w�A>f��������C������>��2���C�~�������^����/^�HQ����'�������+�b�}����,������woWW�lll|���?��3�A��F___�FX���z��C��S�~�if?N���?�������������.wi4�/���;���������w�}������o�q����E������w����f����|��>���G}���?����~�����Yp3Dk4�z����w?�����O>������~�#f��SYY���������3g^}����{���>bv�P���~���o����?�T*srr���o���35����<����~�)��www?t��7������35b{����;�o����zJLL��w��<�����:\�H��0{S*�����e?%,,���E�$�H8���:����������3����������W%&&�T*f_���ECC�������qj�����,]g�,pp���,EQ"���o8S�fsMMM@@��'N�:�����d��q���������f��	

�����{��:������.\���s�Y,�f!��{����"�f��V8��������:����}���})�m�"�i��$��
endstream
endobj
21 0 obj
   19905
endobj
13 0 obj
<< /Length 22 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yt���9�7�8���q�CL	I�L2LXf2���|0L 	01���e��,�����,�ZZ�����ZZ���ZRk����z��;�5�Nk����[��w�?�-����r����u��[�K�.����,_]]���*//��/���qqq���������4�L����X,���N:���WVV�������r��iA�%&&���K�a'+����>}�b���&l��@nkO�����.xxxp�����M�8!F����&((������FGGOOO[���������9���PXX��j���L���l__���B��l������N��h4�*{���B�Lf��`0���lj��������$>>��{M]�|��j��i���<==�=���d2����[�i�X���RRRnw.��������;''����`0l\r'��F�1""��f������"+X,��+����?�*++}}}|}}+++��pp����BG�=������{NN��B�RILp�}h4___�SYY�P(t:]OO��������oS����6�����������j333����v��e}���������m��	����p8AAA����A&����q8���������9u������`"w�����N>����_n4��?�����UX�����p��###���>>>��_��&�������Y/���///�H$v�q'�A�T��p�_������(����~�������gy���f???77���Yz�V�-//���"aFF������W����p8���$tww/--%�KKK����5���8�����
��An��������/zxx��������hk�����P(���|||N�<���i0������O�<�������o�h4%%%AAA^^^���z�~��\]]uww

�����`�T�i=��pFFFv~(HA���bcc���N�<����P(�����9u����;����������A??��'O^�r����x����6
��Axx���ny�����g�fdd���Z�%$$����f�
���W�\��Of�������K>>>!!!���6�bii)..����_����W�� �+..�������v�Yn0JKK��;G
���g3��R�233O�:�������|pqq122�������YYY*��f��>k���%��&&&"##}||���BCC���l�����<�6��b�x������������>�p8�#�����<�\nQQ�\PP��r����Uooo���Bn�3g�������dgg{{{�����C

���.**
�p8III���qqq���g���p8YYYd}��D������������g�r8�m����8��� ��I7��>Hvcmm����T��.]��W�)���x<�s��y��{zz8���ovvviiill,����]$�p�����������N�&$����,//�v����v���l
����������uooo77������8�\~��Iww�������d>{�����_W__�����t�fCaa!���x�bQQQnn��S�������/,,���8,,�:�)��H�|}}�^�ZPPPQQA��|ODDDqqqtt49_�{�t��/�q�����W��>}�������+W���I�{PP���o�M�����D"���;9qyyyd'�_��������i�������dgg���\�~��v��HF�j5�D�V��$aTTTXX�F��%&&r��/���� 7��]YY!K�]F��`~~>	'''m����)R��������\�~�,��d���g��$�����m�`s��2�a�6�>Y!44�h4��R�VWW�Hk�����l�YD����v��n���u���}}}�����I$777___R'o�����{������^BF|6����D"��i���'�$$m����bfee��l�X�^�J����H�����{	H_7}�QUVVf}IO]||<�Z��L,�v}0""�4�-9n�5p������F�����'O��O�B�����r�$�rc�wr6mv������;**�&����st[������X�[��f�;�����>>>���			'O�����(J,s8�@����C 7]rr2�D��[_�;�Az��d2�%r��f��;�������}eEw[um��t�L$�JEnvOOO��N���J����p8���W6�SSSQQQ�T*������T���"##��*22���������	�B�X�H$[%'���X;�Frss��������\yy�u���KKK����"���q���)]RR��j��n}��J%���t���
��Pccc_��3��5p��U����v}����65�����9Noo��e�����^������U��455u��%��777�c�u����p����%������


kkk��

�t���l6�LYYYd�d��*p �())�������w���?��[��Kt:]EEEhh(�l��j'����7�A�c=�F�n��M����(�NG�\��������.���6-_���A�l�h��$���O�^�'�/��HOO�yv�����s��AQ���i-vvv�����ZSS���lSL������_���I��F����
��j}�h��/����@������
��r�>H������lZ���M�]��� �c�~����&��o����`�>�!�X����
__��l|���Mcj�g�Mg���_b4I���_|||UU�R���"�E��
�iHk���JJJ�j�lZu��,�4&��nnn</''���+??�C��}���	x���'�d2���GGG��]����m%�~-Hkq�-�TZPPH����-++����������#�[ZZ��z'+��dc������k���c�&����ArfCCC�2�M�~�������� io�������Y�;yH�o���B��r�V�6OU�pv���5�-+����������4
Ya�>��'Or8��f���dh�R�deemS��} ]:��C2>800@����G�6�*������c�e�����p���wr�h��� 9 AAA�rr��<����n�L&���%��d�l����6-�'��k��ddY�����}��z��ny;�������
��9�����7�V{����L1�P(�/2"i��J�����t������i����
�-�S��H���	_���a�?zlh�����`�c�p�-������p�|>]e)
����Vd���hhh�6#��;��fsPP�u��6>H[�F�m�� $��v
��H�yr�?q���6cL{Aggg���(*>>~��C�����m�����N)��oy
lZ:��u��R�,//�������z���]��9'�{iii$���pK$g����:���������l��
;9�w��l&��Y70I���)�bbb��'��6Ya�Db�\��S8��A2�RBB�R�T(999d���`oo/��=??���������Z|������u�E0���'�������5����:d|}}����F�����������$s���T��-//'%%��s	6�rI�9==M�I�����%I��A�����3���
�B�T�J���c��V��\���[{{;�r2�h=�Gj�M�91��\.733s�cE�^�vmii�l6OMMg$3qm����6-���������WWW��h�J����W���uww������7������nS����5�����������X�������I�*::zeeE�����zxx����+d��;9�w��������VRRBF�����
��y<^qq1II�����uf6��\�r���Q�������NMM��.BHH�����f��
������Ij3�/����Mn�-�@z�l�d�f�_�NJJ
����O__�����ubyPP������������f���:��t[YD�R���h����<�[�8�Hd�Xr��U�Z��M*���'��[a0H}N������������[^[����R����Xw#tvvzyy�---�>h�����7���Nn����Z��������NR����<�wN]]���������e�B[[��
����G��t�Z���K}�S8	�Q_�Q��r����VUU����L&��Yy'K�fsUUUcc#�!3�����`6���;�����������������������s���m�XFFF����B��D"���%�@P]]M���S$�o����������J�_g����B]]I�X��"�kn`��;<qUUUuuub��zT�`0���nzL&''����Sg7�b�H$���������6������f����9�*������M���rycccuu����V'e+��%Iuuuss�u����|��X,���	r������H�*�6g�)����Baeeekk��O�������TVV���l<2�@@�����
��B����e��j8�vH�hKK��w��0�L.\�x�"{O�������7���|a�������k������������S�e���q�>�����bo���H����=t�G�#��d2555���_�~=33srr���]�PDDD����,OHH����^���������I�+���������7�D�f���S���n�0:::''���������������J�Y&�UTT���EFF&%%577����[_XX����~�zBBBcc#y����Hii���2�,55�b������TTT����W�m���������=�sss|>eeeg����\VV�T*M&�����V����|>�h4Z/���HMM�~��@ HNN&�EV �q������m�zkz������F�D"���r�\��%%%����LLL$/<���HLL�h4&�iyy���";;��=Dj�:..ntt�h4�����
��v�����m^m������AQTuuuMM�Z�������d����L����=*�*>>~lll���o]�766vww�����juWW��/.OMM�J���������#""���Hh�X���6� �R��������� ��������t||<y����ZLL���4i�����w��X,���\�Db�p``����zsss����Zk�6�Lqqq
����`0��F�1**�|�j9�������1��������������Df�933�X��b)**"� ���(�-**��SDD���4�y(�Jsrr�mkSw�����6o�����Q�@P\\LQTYY�����1��%l4������V:4�L�������\zz:����1�}G�R���t�����mmm��k�6�����d�s���D8�������6HO��h������y�&i���vtt���B������@���?[,�N722���@��d����ZZZ���������6�&''srr�������Hzs���R������6��������|>��ov����9::����}�h4��s{{{zz:��'�}6�L&�?���tww�%6WVVF^z���.��n��������t�Lddd\\\II�L&��w�TA��ljj��1�X,eee���B������������_���H�.66����F&�%$$����r�N�KOO����>���Z����"��4������������MgFF�Wbs������kkk			���>XZZ�1?�v�E7b�
�d�B���lt|��"�H:;;'&&
Err2����5>�o2���������zSS���?��-KIIIww�V�MNN����e�0����MK��(��LF����7�����`�L�-F�G�R�����n�`��>�P(�|����A���1O�`0$&&�����Im������������+>>^.��a]]I:�j9iZ��N,,,$$$(��[���#�H���7��:;;I�L�e�A+�Fg��DGG��H[Offf||<��������TZZ��'���A��?<<L�����������"�J����R)Y'--���upp033311�z��������|���DRRI��j��gBww���8�ABB��G�������h������������>:��F�R%&&�\G�zZ[[�_�N��l|������\�'/h����� y���� &&&22255���k�,������������L���5z�>**�:��4�[ZZ������
����Y������������b���������h�a<H��/_>x���}�~��R555u��Y�Izzz�|�I�����:����T*O�:��3�P�����o8zO�9t�Pqq1�������|V(��wEQ�����


���P*��_�IEQ��v������kE}������<==�y����~Z�V;bO����===����\.�(*  @*���o��&����
�����0;;k��
����w7����8{��w
`k����=��D"��<xphh�|��/~�����}���A�Px���?~��s(�
>r������o����/�L&G�/�$���s�	Y^[[������������"E�d���;�B�ZqSa�XQJW��|��ev�J����	�5����������fC�x�}��3juF��ba<�u�)�BYtF�����B�P,�dedli���/
t�6�iW�S���<wM�b6��f��u���%�J1�2r�% �������	����N/��Ppr&}�pf�@K&HkQ9�3jV5K&��lZ�W�iW���b*�5(���h��l��0��bL��Nn���1���._o	 ����7���r��0�+Lc����l����Pn_,O��]������}��pZ>�vFd�2����T5?*`� cK��k���qLf��B�`��mZ�����^��m&K��#�����
l�b|yhA1c��2���CYz�����d�z�N���k���1[Lv�r����R���$������h����zV�R:��p3{�H���O����f6����i��\]�M<7 �����|���d*,%�x�������D��YE�v�4��]���c6�k�*85v���3*���O��J�A�x�'���
�A�T��+g����B����N�=�d��R�B�:y�%���|���-OrM����-�;��+$���3-vk*����A�=����-��~>IiP��
���!�`���j,��ltxm	�U���X,��D�DyN_l�L#���G�+F���%#����&�^A6�`X2����>]��
����W1��2��h���jdZ>.�.�v���2*��kW������kM��f�^aB{R�A�*3�q������OeR5�[ J��og<N���gB���WYR-$v�t�9��� g|G������o�\����^c6\T��:2��pK���f����j^��/J�	�M�
����kr~�p��2	$�;��c��l�����$����*%���=�'���j�T��D���I(��a0����+b���v|GA^��J�%��	���������
g�\f����1�"^�G;�:C���x���0��|p�����{I�����_0,�M����D06OVe�F��D�D�yqi������!��7���1#H��1v�6����u�g�F���c2L�
-%����3|G��e<�Z����������\�����Q��YUjwO���]";P0��xX4�b�����Y��8�����	^	�
�-f�^O�!5p�@BFOdFo$��������M���8�BV��V(J��os��q������%����|
�����<W8U��k���Q�_t��:���9��jX�%�h>������B���
F�Q�w�!iO�
m�ux�Y�>8"�o��%������N���x�������3<w���A��-
������:���h��_4�7*��L�pF�`j^O�
o:S2��l8�<��*��<�<�2�l�7jc[�R1����Ag��dFd����!���}1����������x���%�#j��Io0�������/�k��&7&�a�l�������[t6��=8"��CY���0$��l	�/qx�!O�9e��A;>�
=s�<�` ���?S/
�/��/��u�h�3��lx�����������������8��(\XX8x� ^�v��G����������Q577����t���_���������5��t3��!i����)uL7�����Fd}#�������/����u���n��MKn�cI)�9y�����C�h�|��nO�`OO��O>y�]w�pff���jkkS*�����?��c���p�R������J�R��'pY����#[l�Td�F]k:�z���~��vPvo�����b�������3�u5w�;Wl[O���%����fzr$�B�Bgl[���sl��!>x�����b�����~�m�yhh�[��EQ�%��_����?���7|P��U7V���k�m�?���a���P��"���z����k����t�6�x���!i�sa�ls�P:IeU�A��@Q��������$��Z�}��GQ��������Bww�����'pMD��1�K��ndw�������-�3�_�����G12���|G���/J r�c#��F	�3�G���}����\.�|6�L��sEQw�}��d"}||��9��fgg�`���t����R.,��*hI�i�t�10�9t(��:�"���"6t*��6���E��6t����/�d�X������t�>+)
.����;��`J}��v�����A�G�)�ZYY9p�EQX]]%�;�����^��-v9���D�#�Jy���E�4�����;�l�D������Wxm�]��4 �'�?��������


?��������G�R���/������?�|}}���� {c|p|yh`������+���n=��e�B[>��g�R�R���x��������Y`�^���>h2�>�����8q���")1���=���'^y��#G��FG�'pA������U��
��$\�f*-v�wG��>����$N��V?+��b�1X����-����%ifO�����������g�Tz��9�v_ee���DD�������"���i���	�1���������
�
k��aX�_n?�����(�/��u?������+���&%{&X����X��I��.�t���q���Hv����m>������x~������u~l\;S!?�#x�� y���r?�[b���VQ��b��}L�����Z�_������_$w(� 6�'O�w��x05���YR�@R���r�0��m��|a']�~��}�(�Aq�H~�`�=���s[�2A� ���c[/����h�����t�4��������e������$\k:��\~��w.f]�b�s����<�BQ1�!iON_,�!�O4dte�=U�=u,F�+�;�������mH�Ld��Dy�DY�8�'��7�m�(c*�/)��r|K]S�'.����^EV�e��8������m�3Q����S$�2o >�+4���eYU����cX
bR�A�<+�vx�'|������Xr��	��V�A�6���(a`b�UVo������z��!��q6��<I�T~�A�I�p��������l��qO���]f�}3���ly������@Lf����WL�����������|��pZ>^+.k��k����.���n&x"a���E �����4a�-i�
y��}V���������%�i�%����{��o��<��`8�<�9#`��;�9l�M_�hfw�hk�6��T�������6���(�Tn�r^�c�,[!|�"�3O��	�c���N���z�L��yL�"�Q��r��,�P�u������ub�jp�0!5NH�S�|�Y����HZb���u��>�b�L�����X5A��@�TM�3`[�`:���a[������af�4�`/���3�&��%c�@U�qs�I
��Ua�������#T�g����z�����X,�g��e=Of/�`z�*�^�9�'�����=�4uC�
�[���
E���z{/���*hWe�(3[T�+�^I�#n���Ah������k7�drzSY�{�d��`A��}L'�Y�AkI�F��8�t�&�'��
o�Hu���&5��DCr�~��_��*� �X�Y����!cK���a����#�~C���a���Y��}���T�^;!����f��iJf�����X�JmT16��o��}~HoO��= � �B������K"���IR�p����
�{"#�'Y5���Ha�����������UY�zBKj#�a�X�p�E�oS'7�<���0E�p�pc�O$N%���'�����6�7��2��\��`h�,����/�h	 ?����C}cZ/0^k��n�+�_����������E*��>�`H�����l�uj����c���1
��EZ�yd�����o�0�/+z�/�VQ�����^���V��A���#�H�7��mn�T�M:�Q;*��	�y�|�Q�T�1�����}����6����;�����-�!�������Eb��s��kj�5Y����N
�a�$L���7eHnP�7�#���K]��!MQr�U2��k�g���]��|�����]�t}\����2��`��HAh��gz	�_e�Wt��.2��������I�!�R���S��`a���_3)3�-s[���7[��U}��|������'�#(t��������9��w�G��wx�	�=
��n	�5�2��C�,���2d�������.{���Z����'�;�T�:���`x����S{��a�+��&��zY�s�X4�]�w���u'�;��I�����J0�E{b\SK��I=�4��
�����)�0|�.D�����)v'=�����x��`�`������ ��=�O�W�5��'3�f����iRfl���?M.|0���V\���`zgVx]MHU:���g,`��ZT�}Zt�B��[�OnPe6�>?�0���F��� Hf~u� `���������V�E��������������m]�v��]�\����L/9�����������{����*z�"�������z��~���NQM�����=>��1~����:;������.u�X�&����OY��a*����sBN�5�`l��)����4�X(]�	�����m<�1��'�p�m�_^��#����>�����0LcXb]���_+�2�,����E����y2cKC�b��g>��|#�T������u"���k�H!O�:���[�n��z_��b�hr��'vl��������������>
�e@tK!O�F��`�xIN_�p���S����qRp��Sy���`�p=����������C�Z��mN��������������k����r����5>Ia�s[��B5^�A�D
�	�Bg~|FOdfoTJWhV_4����0 �Oq��b�1>A.��%Sa���K����3�i}�fC� `���T�������g�I��� ����1�i����	���ap��F��49��{���,�������>�<rM��}�z���I}����P�5;��{�N��*����JgE�c�������7^�^��\9���kjj|���t:��p'��������k��1��a�8�+��0�#8S�`�i??���J_.�T�7�R����i�L��=�������r6������|������o|��A�0�M9}�<�u�vF>1#����b6$�3�g���!�x����&���S���az�����9���i,�K/����GQ��K�����G{��$^�N�K{��y������L�������A�W=��Y5AN�RD���zd�84���C�h`�����]W3�,�
;���. �W�T>���������_�����?���<�Hll��wm��3jY�;�j�LaO�����R$OmT��������V�n
h���f��Z\��W����L1vN����b����"��A����O���rz��G�]����������ovvv:zw1j��q����7�FVO�?�$�*�|���~��F�W�W�o<d<��6����U����*���Vd�������FhO�y|����g?������O7.���m;����'�F6�+�������|�Tt��-��
�����������U[��K���g���q3m�������#	�UIfd�[�����J�g���h4���c��^���]����=H� ����v5�����%��z�������������r����O?���E�?��������������K$���������i��"�'^�V��nXE��_�T��e��M*��8�B}�����	ZG_}w��B�i,#��b�X����������5:�������}��=���EEE����|�La*����ZKk\��K]�����0��4�Bu��%v�V��/K�����N\�����[��!V�<>�cc�h�\k�l��l��0�Jr,v���w<~��C]���lV�fnv`2�-�
������\����z3>�"��+ S����I�B���gH��z�'��O�3~V�M�v��r�
�� ������J�m<�pS��������g��b�

������ic�P�:����B�	>�
X,���������x�^q��>�l*�Y���$y2����.�\����a��������]�f�
tWJt������@�
>�
�����������5���.�����/���%�������5��bYoL�v�'�*��+0�*�	�a�~���hj�4�b2�,:����=�s�_q7�{�B�	>�
��'x�������(Q�+T]�����K�!�ttFm�|kFO$|p'����W���a�=�d Vt��~s����+��l�;ML��n%����Kb�)pk8���X|��k	-%>Y��y_|�����jU����@������@F����/5�v0���+SK���jh�(�:�B�	>����-i�<�r�E�M�-q)�DTt�=(����+��^��z��fv��s��0v���B���ftL��]r�^"��?��/&���gv��1�3��-��D���b9���c����T??�Rj-J�eIa>������g�T~YZ3�'�b�>���W����d�%3����]��|�X��4N��N��P�]��b*��������w�2{o�-�T�������7�80�-�BCN�n
)��3��G��&��L�a~�f�%f,�4�U�gW_'C�|���o�	���<�j^���v3~V{�X��>X��_0����[Y��^'C�|�XQK��[&�����������2�JcNd����7^�;0����nEtI�����I��/�H������� �T[���}8�"�6^R�U�^-Cv|��L�7I*���A:O��`A�:E�JkR�b**T���vN����$�a}��%�A�|��d�D�����|�~��/sE�`���'��2[��S���az����v-z�en�<��G'\N�A��:U�?�7�f<����>�qh�02w��)C� ��`|�5tfY�Lo,��f�dI�����?'�����������������=��1
����
y�Ov�>x:G{�_U��|Q�+A���
>�p&V�'���yY��<7Zx�d0�d05�7��0���c������b���`\��+M�6�<�+A���
>�p:��������������x�F?����<��A���A�{A���
>�p:�����p���`�`|�5t8#����������,6�TjD����]S�A���4�/�(Fj�����w-!O�j�'���:��|���;��Oq�v���V{��d��y2.+��c)�T]���p=���6A�������9c���g���j=���	�[A���
>�Xx%�K��n�`���%�DV���|>w�J��L�vQnZ��2��?}}3�s�J-L�n0>���:����^�'���'InP��f@����*�\�n����(Mt���0�Lg����]S�A�r�p���d��;�
"�����l�d\V�A�r�p�#������*�����r[UL�����1�����������r�*�'^��oS7i��%]��z�a�P-S�\�%��qY��
��'�"P����-�#1�vi�
��4�qY�~���������k
>�X��d��lu������H�:��85�A�|��X�����3��88��85��qY��u�|�<�|��X���������<�y2.+'��w�}�������(jnn��?������������;z���A���|�i�p�y2�������|����~�%��R�liiy���R�Cv�b��/
�a�xI�H���f**�C�N|�5�>���z����������������������~������[�HO��\������h���+Yf*<�'F���d\V���%%%��s�{�������O���Su����t7�@qww�������~~x�����������������Y?Sa@�|gy2.+g�������t�ypp����(����6�nV�>>>g���������,���P&�/���������Y����j���!6T�<�_7\-�f�p�{ivT7�W6�tBN+�����wt:��}�(�:p����*Yx��������R�F�o,) w�BEO��$�"O���A��3�#""~��_��]]]�>�,EQ/��bii)Y����������z����wN���!O�g��A��O��A�>��r����?�8|��'���)�JLL|���N�8��+�9r�h4����ynv_�>h3>x>_��puE��� F��e�Vo���%�y2.+g�A����������??00@/���������XYYq�^M��f�D�g����yS�boZ����`�x�k����DF���Th�����qY9�:-J���bWl��(a������5����x���O�?���/�'X5��*�M�b0l��qY��G�[��no>�pK�2�G��G�B}�
�����Z� �Q�t��%����kf���n�Z�-td�E���s�b�E	��QbN_,S���>z|>�c�X��F������zRC����z����`uD�]�N��m��o�	�������7c>��`2S���W���f��l	��2�4@U'28���l����&��%��K��MeC��b� N�Fo1�n��k
���P�f6[,��!��aA1s�:�;���*V5���gCvfn��5?d-��6u������N�/���<�$�o���d�:���[��1�0���E�9Y���2����%��W�G�U}��A'|p+�%:����W#�������-���K'�V�j�+*��>Rs��aRt"���U�=�]�p�����w����U�ySL�����:U�4���K��Ag|p+Z��I����2�Wk�5����zUm���_�T�0�}����1�� �|����V����X�^2���s[���1�,��F��� O�	��V��Oq7�d�+�X=�A\��8���[�*��y2�A# O�	�
�<V}�wQ��8a|���&��VX���Lt6���&O����~Q\��8���[a�'s*CN����W����r�aV�~~���p�'���n�u������i�k��wt�W�������/�t� ��d�P������9�-'��^��l`*Lh��i���t6���:O&�N�������&��VX�����<'|p+��d��F@��
>��y2�A# O�	�
�<� �)0>�l�n�dlt6��y2��fh�n48���h��y2�	���W�������������4�6f0�����<�h�����L���r�_���*���T�7Z�Z����};�_v���[�<��-�x�<�hf6�_5����k�/�E���@�`7�0d<���$���%|0��bd����g#k��Q�y2�����R=`��#�~���E��8
��Q������s���'���$+#����OF���]��h�Y6I��>���6�M��7�C��l�������`��]�����ai��������������0�k�����������A��B�f�� ��jR�wK�;KOfh�&��lQj-j����0���V�f��G�2Bl�
u�{g4�\��$� `w!�~~^��5	����L��\TId����d�n�>�<p�ex��*f}�sW��h������X�`0����|2�(fW��������v#�E��5"���17��(f>A�f~����D���>�|�Z�����Qb�(1���!���'	b� 8
��|����������ppp�������������R�m����*���q���;t�P�R�<u��3�<CQTZZ�o���?�`��Ry2�����|V(��wEQ�����


���P*���cTCmD]5K:_�I�`B=|�����������kE}������<==�y����~Z�Vo\�x�mU���]�����k���������r�������TJ�����|>��������JAO;����Q+b[����ns�B�w���M�@=<<��=�q�������XY�Hc��/p�<���|�s�='�H�%"���_���m���Y�~+a�k��:���l�P��5��0��K����{��������CQTpp��#G����~��_|�d2m��z���_;%�9y�����qE���.�K�������%dymm���BB�^���gW�=8�m�7y������=\3O�va����G�U��t�.(�".5>��`����9�\mqtA�����9Z�L����A=f�sy��e���Ry2_�W.����d
�J�.��%�.>8��q�<�/LE����h4Y��z�

���'���� W>�eA�Wy2\��pq0>���pY�'��A�Wy2\�pe��\��pe�'��A���\� ��l�$2S� �d�"���������V8�hLi�g
��������V��B�KJ���<!5�W��e����Pk�8������n��Jg9���������p�����(�\��}�JJJ��?��������G�1���D.���G���(��X,�>�lss����XYY!�?�����`G�`?���n�/������o�:��v����~�NG>{xx\�p��{������b�����JIIq����?����T*���8033��=������~��}��=��#aaa��������;w������5�			�smmm@@@||�Z�fv+6���2���%777  ����^Xs�b6����CBB��9s�������q��gO�w�Zy���T��7���{���^��������O?�tVV�w�����SO]�t�������~@Q��>������o>��SkkkLm�����/������6AQ��S�:��'�<������d!�[���G������>�����������:p���k�������g�{]]���?��/}��������o~����{���C������s��y���������
������������(jzz����n�K_����EQ��;55E����f������{������6�����/�����
		a���~���/�l�X����=��#Lm��v�3����FMM�R���������d�{����Vp�w�^:���s�����S��h���.�^O���{/�[����L��?��B� 9����������W_}��zw�u�_N���{�J%�l6����1{_�;w���6�J%�g����3������h4E���{yy��nnnn�}�����{��������'N���?�����5�����=��V<<<^z������W����i4����C����2�r����=������l�G�����g�1?�������1x_���=���eee+++f�Y��������w?���������g�������?���O����'?�����dG����gp+8�;g/��]�����s�rrr�J��/�|�}���?gp�r}���x������z����_������MX333��_����b||�'?����g�%KKK��������JJJ���G8p��w�������~�!���8�;g��w�����/^�8;;������?������gp8����9�������x��G�������J4�����W�^������_�������FGG3�����C�>[����|�c�=�o��g�}�����M�L����s\�t�������_����y�r��w�={�`00��
���Q�������~�[����.���{��G��o���v�N���������o��{����G}����MKK{���H3����?|��|��y�v��D"�3�<s�}������\.'O+���>�SVV��c�������_������d9���'�|���}��?���~��s�������|�[����`j$�mS�����B������'���t]]]O<��H$bjiii�>��\./++������Q�J��K/1�	R������{.%%�%7�D_���>��2b�V��������o~��K/�8q��^���'�D2==��o��q�>����gii�����_������7|`����+_�J~~�����d�j��������O�S�6��*�^x����
�"--���_'�?�A;s��������O>a�r:{�,�ve6����?�:u�����������@Q�;��C�����#00��MH����~�����P($K�h��g+�;���In*��������={�|������Y������>�6Adzz�����c�=v��������i����&l2xIn�N�c6op���t����'Y����l��<@��|���d�>� 44��Ml�f��~�6��*�}���T*���������D"����'O�����}xx�o�A�f����_��������,EQ]]]t����-������=�����R�d�_����u�]���������>�����cpf�������Z�V ���kdaOO�O<��&l*�������W^y��Ga0_�������J������{�!��X,f����{����<G�����,|������~����A�����^�JQ���G�DDD����6���z�!�6A��������P(>��s�����^���������c�=���o�K�r�s�=����������z���>|��rg�J����~��`u|�������O>��W���{��'==���_������7���b�|����>���~��'�x��7���������WUU1�	�����w�yg���O?�tzz�3�<������}��~�������������^{����{��?d��P����~��'o������u�L������|edd��M,//9r��^�����\�����o����|����Lm��7*�K�.:t����^������^���cyy�f2"�J��K�,KNN�����Hg��H{{��UV����ZRR���D~����A,GGG�� ))iii����u���`]q1�R��v�K�L��[^^�(����~o8S����:�w������������@Ba���X��j���`*�C *��8Y�]�}��7�{��$3��z*�R>O��������u]�L��5��w��`�eYb���~��y��m���"�R�9w�8�3��{��c�?�����
endstream
endobj
22 0 obj
   20635
endobj
14 0 obj
<< /Length 23 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yT[��?��:N�tz�M��M2='i��&��if�N3��S��f��3���o��3�6K�L���}
��1�`���Ymv�}�wH ���B,���;��oT������~��z.B�>z���}��J4
w�R�����xd�P���c����_�.�g���%�l�X,>{�����@ ������uH��������xxx\�x���~�������'O�����������z���%--������+''�j�.{�����8qB����!�D����9RTT��#�l6ooo�P����y������c�����ggg�L��K��9b0��?O����,�6�-44t�����\YRR���555599���URR���]]]E�����3����(��"���R%������6���P������u(??��(���!��299@QT?��\.�(���===&�I.�{zzR��B���'%%��III��������3g��uV�q����(�R��k��M�W�w���3g�P����,�(*11�4���)����d�P^^NQTFFi���������=KKK)���7�|dw������y�������455q������'&&�_t:]~~�����=���������m4�_�VkEE��<<<<==�����`��aaa���G��r��\.g�E��yyy���G��z���:r�����i�Mb�J�TFGG{.	����iM&SAA����=<<N�:�����|�F�III9q����G�������#NMM���=z���'55U��,����8���c���LNNFGG{yy��'''�����Ew�P(n�/E9��Z,����?O�_|�EQ###��\������vvv�@  �����;�*�&Avz�������������(*>>���+:::''����E�����[,��			������<u�EQ���������;������q��	������N����E]�z���uvv���{yy���gff���P��)�����
���...�m/�F����s[f�Mb�oo���������`�F�S����n��b!����IQ�����i0��;G�[������5z{{_�t)///((��#�������`et����?���.
������)�:u���������;�;v��(�j���(OOO�����(J��2w�j�d;I3"""88���v��'�cJ6�eed��q/4<<L������LQTxx8seff�q/$�&i���P���G��f��f�������U*Y������-H�����7�{����N���t��&�������l6r��������n�H$ZV�QXX��q2SC�j�����<j�X�f_�v�����2�;8���'�����1:UUU���h�&[%�Jo���8�r����2�IQ����I�����(��V���(f�sdd���3...66����4M+
��D"�=>��2��rff�4�EQ������������@����?�3��iII��n�����4��Z[[�����������c�������u�Mb�E�%���E-+#�So��������9f�p����C����Gd�C�&EQ�wp���������:r�=88�Q�	y�>���I�`�4w��a���n��%�������^XX /����_�p�j�Z,���Tr���WU��!{K��d��nd��`0_�r�L@1�_����_�r���|vv���e{����;M����;��Z�g��ef�VZ}���ww�6r���v��y��G��;�CT�CH�^y����������<x�c4��49NLJJ���p|u-s��4Mgee1[r��������?���;�@4�L�O�J�0��������kr���*����[���l6_�t���bbbJKK�4���LOOggg������NMM%�f����_�r�N��{�`T�Tw��Ze�V�w��m���������"w �,+�����79r���p����,���k����X,


���G�!����{�
�;������Hyy�H$���W*���+�efB��9��/���(���U�����<H��1��:���a�<���LEE����)�jiiY�����5'+�K�H$EMOO�KWn��y������z���eH��c!����}�A��W�y`�yO�DI��=>���L���.R#�s�|��8\F*�R�$/rF����\q�OLL<v��7W��\�Z� 9F���a���/#G7�����Rvv6S�A������O���������������T/�K�����G}b�MZ=��������X��JKK�%j�����9hMHH�(�����9���<�XB&����H]%S�����{������_v����9r��uW�#���iZ,R+�$���j���]��>�\\QQ�����V��\�Z�`kk+9����0�mmm�Y3R�N&��^��T*�V���0IC���4MWTTPm2��������H�Je4���=<<���HM�]���j)))��q�MZ=��������!ww��G�655�����i�M_��������===����fsww7��k8�[S����;���A2M�d����j�zqq1==�\yw�y�>FG��
���744�!&��r�v��?~<---;;�\V�LG,..�����yyy�$���c�����j�t����g+��tU����A���������eee$��+�M&�y2�9RSSC�}tt��;c��Z�v��Az{{3e�w����<)_���7i�<x���ie2�3�B��	�����G�2%��k�����;v������d��

Z�����Ngg��2�����9����e���d��9;I����a��_�T�]yU�0���{\b�Z���KKKkjj�)������"rxEf����JJJ�R))MgLNNVVV����dass��,��{�SECCCEEE��&p[�l���.[���]e5MCCCIIIuu�����;����������Y��[�
�����eeeb�x��L�X}}}iii{{��bY��+���}����j�2�������R�P�~���f���-+++//���������D"��|f�H�|r���
�����^����r�mu��FX,KmmmLLLxxxJJ������}qq144���b�������*�%SSS)))����� """55������������FFF�����,))ikk[�P%%%����LqqqtttXXX||�X,6�+�>99���[SSC~G������`�=gffn��a��L&Syy��%����w��o�vyeee^^��/����GEE�T�{x b����P�V[,������~e2YIIITT��lv\z��
�N�,�D			$�;�$q���&&&����{����i�����������y��?>>�����+..������@\\�\.��t�evv���8--�������FGG�����f�V[\\L~��l6GEE9�������LF�4�qX�Vk4����������b����K$��h4111}}}��������q�����477�T���f�V�mjj��O~��qczz://����qyhhhMM�\.'M��_]]�22�jutt������<HTWW�|:22��h�O�_�vmdd��������g�l������A������[���AnWTT8fm�����H�tSS��d"��fsDD�}��$�������Y����&'�***:;;�VkJJ
Im6�-77�-399���H�tooonn���BCCGFF��������t��u��V\\�P(�-\k���bfDE"Q^^M����"��,���[����D���X__�4-�����������$�O}}}��C�����8f�U�ONN&$$H����l2�
<���OHH 3�J���477gee��`�����������3'�H�"�H����^&��5���555-[H�.C��x�f�������XrH6,11Q"�$&&23�����x�544�����]�P���1�KHH���&����e�7�bq\\\TT����*��$sdd���V��N�KNNv�9744$%%EEE���e,������������4WXX���O�tRR����]��X,^kd�d���������gff�}zz:44�IXd~����^���VXX���QWWW^^^XX��U*�VWW������3s��fffbcc������`0$%%eff:>���V���J*-������999������e5�����_I����*++[XX�����lw��+�C�:/���G�m�FGGWf|��688���800��������iaa!**�b�455����i�������q\n���������z}BB����];l���Xv$H���j%g�V��#9�)�a*a�c1�z4MTTTss3��b�<����������l^Y'c2�����=	�����tuu-;�������������bbb����fee%):��rrh�\:199�V���/xp���iii+������������t��������d�|#�zRRRbbb���m� 9�7<<������?.� )�������&�MLLL����,h���������&�ILL\6��������x�T*��G���'E�wZ�x�hnn^yp!66v�%�J�2>>��}2������������K�F����#��L����'����A���322�]y�X�:����������k�������q�����U�r�<>>>,,,%%�\v��h4FDD8���Ci�D���3;;�����������	�G��l���w��K�����g�#�<���O����l6���^����Oo���_��_���i���7�m���o}+$$��[
��={�x{{��������>33sttt���R�T�V��w����Oi����Z-�H����igo8N�:e0������%$$����d�\.��w�K����;{{{�����7B��y��>����+�����9s���#d�^���Gi�~����t���������`��f�������L�����@  �-��?L���-[,Y���y����266������M�v&�����

"���@rN��i�J���F�������Y��'�0w���J�o�q��yfIuu���cr;??��=4M���?-(( ��UUU9i{��o���^{�����r���s��}��:th����$&..n������/~�k�.����
`��������r������\�r��i�����__���P�J��M����P�vAA��-���dauu����#""�������x������~����{�1
��;v<xp����v��Z�N�^����;�w�f���juss[XXp���={
i���l����X,v���,,,�b�0y����o��o����?��#{������i���M�R�;|����/_v�&�������/������N�;s���~�+���l����zzzz{{�|�����066�o�����A�Z���W�J���m�Y���q��Y�7�CL�Dyyy��J�z��'h���s�B� ���/
����s��/������^��������h�>p����G�Z-�J���FGG���lr��t�����{�����k4�����w���u����~:88��[
�L����j4���7�f���W6�MkTGK/������E�<[M�a�F����g�����c����.az������Y9��������"�����;
�W�f}|c`F{4�_E��xbq��}XNo��!������Y�J;���,g�Y��
#����9Ym��`Yo���s��5��+�r<�����@�@<T�����Y�����i��n��3�*BXgP'��d`�C�lf���Mu2���N63���&�:��P'��d`3C�lf���Mu2�L.�����KXo�4m����
�y��;%P$h�tv_Y`��:'Kz�F���������T��e�
e��rA����m�Jv��i�X�,���>}�tDD�������#��<�����E����i���f��f�TS����bpvG�&~�d�'zf�l6��bj�m��������#��f��������Wu^�
v��aW���\=��m6�w���o������%&&�4-
w��q����{�����j�:{�a](�N�����]��
U}y����;1��7�������\��������&GE�0���^5�WM�G�j����ug��\?cf��g���7�u���K6R]���V��/g`��b5s���;�������������:������[o�4�g����B����W_u<N�M�f���"A����=a�D}��X��tvw�&��dH���ew���Yo�����6���3�5�yx�t&-��pmaaa��������~�)�=;;��O�4����R����?�������^X/l6�d��P���\��rv_���p~���Sv��h&l6��J;]�_ (��/������i�8w��<x��I�@@n[,��|�+4Mo����������v���:b��x�_!�7��Ip=����	�Z�8�����@�`h�����<&���9r������=�M���m3n*xxx�={v�#���5�&#�J��r8�4q\Lm@n]���
7�I�Q����\�z�����zN�����qt�8��^dJ���t������cll����W�<(
�}�]r���w���4M���S��y6y���B��Y�	�?��e���"Ae_���7�Y'�u$����Fp}fM���A
����<811���OJ�R�Z���~��G4M8p����Z��J�nnn�����^p>��(��s��n�hh�h&��]���:N#L�(h���>c���i��x�����}��7�|s|�~�uxxx���[�n}����������^�Nf�r��������^NO�N���Nfsr�<�2u2�-a�"��l7�}��x���������
����Ed��������<��������-���*3�SV��q}<(�/����,4MW����E2��,69��`#r�:��9E����z||a�e\(�h�:�0�bS��4��z���+����\�?��v�tm^?[���<&f��:^g��&G�����g|aH�t
E�����jN��r:�U8���+��d�G���E�`�#��c�d�j���E��)�I��#<p�:�����:�MW��)�f�k�����)����� ��j���TvO����,		�n}f��i�+�I��j����
�����������11��9�X��Qsk������w����n�PdwN6:�����R���|?��P(;�T�F����p��� ��cx���/=�u���d��������a�{���A�{ ,���h�"+P$(Wds��huqO*sZ��p�<�Ju2�{�����.a�`��_��?u2�c���<2��:Q_������
`���W����O)���u�a�*��dvG�3��M�T'����Sv��}�D�����_#�NW���M��W�M�
��:2!3��$�O4��<��3���Na
�WxIG*��d"�=n,�����������������@� �%���Xl��C����S'�D�@qioF����&�:�+E�dCioF�`1�Mr�	�!�L������
�8?�:��<�Tu��}�I\4�j����}\�Ss�p�<�5���x�I���|"\2�7	���5��/Od��[�%��	�6�tJ�P��w���zU=�6�D�S[#{f������[���jan��f�v&0���J�f�
�z+9NJ.�����fIo�(�K&G'G�;�Xo��(�����L�b��;�����
~�d�����/F���!��L�H�?+g�I.��Sv�~�����Jz�Yo"6g nV�U�����e�4��j���\4�$����,	�;��J�������'��uCeq���Q�d���?�~9a�I�P[���
N����yp�0�
N�!�����
��MH9�H���arKXvg\fGL��R��rvg��x�@a�������9�7�I�0RY���������NfJ=�i/��]�}y	�AY�$q���@��@���QC������f"P$��^t����������/��$���t������!��~`Cou2\���Yy����HB�t%Jz���8�.�7�a�\�U�����+�9���(���zR�zR��m&5�:��������ew�r=��C�����9}����&a0���bc+�{�2;b���gvD�<���V3�=J�,s��D�����E���3���v-Jzy�E n����]��XY��C��6z�R'�/*k�h�����@��J�����v��'��� �g n.S'���
	�U�\�3;����������5�d�����M���;����':}�<��f�2u2��/q�?�ot� n��d������<��X���P�����Nf�����3�7	�8?H����k�D�� �^yp�p�<����������C\��P'����<��NfM�Ju2C��U�����,V��fbV;�Vs^?k��F��9��d���f�:�5���������E�k�g
f�����N���o�R����� ����@g�T���t�;��������5�d�O�LGawJrKivM5EK/�h
f��x����D���@\��&o^��=�*l��0R�b�f�N,���r�(��]Nn���d��
���,6��nH�Jy����A��� 8
�A��X�R���3:?0:?P��(�v	Yl�z�y9�FEWj�B%�����y��"�����������-P'��4���g��-����(�Ts���s���f�T�h����Q�`iU_�M�TK�@Qz����kd������b3�V��p�:��� 8
���`la�����j�Z�(T��5�����Q��6�z3s��<�)����d\&P'��.���h��l���>��oV��Qw�9���]�^���FW��q�@���@�+�hu����;�4f����l5�j�J{�y���6���
8?��u2Ny���g�A��I-a���AU�d���;��5�M�=3m\�|����e���
~�d2����S]��'k
����MZ�6d.�R(;��[�j*��<�r���j�O�����eC�u2����������#5&f�9.)SduO�:}'���O�LRK��DY|��+����d��Xo�<�
�Y7�����9}�0����
���!��:�������@� V *���CXlF����:w����������j��O`�0P$�c�O6���J����O&bY �B>����5��>L}����������
����
�gvD�7&6��U��]c�Qw�L����+�O\�
��b,�fIo�t��O�������*f�����]�p=��)x8?�:^��%����Ie_��_{������%��[�z����
N#VypM�C�^.P$Hm���#�qN��:~"L�Q��<��ypu��Q'����N&�%4L���E������2_U�q��:����3�
#UI-a%=��7�&�e�L~������Ij	�����)��j�/��l���N��7���$/@�7:������AB6*j����%�w5E,6���S����Y��"A��RU~���H���@�T��#(���6��
�
,6�����m�/��l���N��<������4V��W���+�A�W05\"��/*���7��$�����CRZ#�JYl����I�J{38-���i�Rs"Ht<������m���!�
�M�����f�B���
rZ�{��}�����t[���������x~:�����dH$4��.1g�o4_e����r\����[�:b�����<���l����UN�	���r�{EVGl�HU���OIOz�L[�L[f{4[���H�o�K�q������@��j���_�h�p}�k����l�<�%�`��h&4�,F���@���EL�Wk}��b��g���p���j��d���u���!�6�����XY��_�Hm��R�U(O&CV�����/�Na�Y��W���CG6t�Z�(�+����s
��mR��0V�u��J��P����8>��:��1�������=i�|"\r��'��'��fN'�����XY@��x�,�������{��:~:���`�TS�ds��cR�b�cR�OUp���&�q��4%C��$��u��\_��Xo�O��=��]&���m�������D�����%E,��������5y+���ypP��C�bB6Z����l5�n$�#��{%6���
8���pJ����5E�-�������<�z����)����;����~�"A�Xm�T�M�H��Z���02"��!��K)���7�0��SZ����>eW�tk�t[M�%�e�L�����X'y�����///������|��m��}�[�
		��q�;'���x%����*��l��/��D
�Mfb�`7�ZB�kN�JN����l��{����K?���"��$���K����K��D�VK$��|zzz�q�k
�g��:W�@���������*����+5^Wk}�j����8�F�������z����4???�DB����������o~#
W�#�y�g��n�,g��-���a��A���EHEyRS��{�@ \)�C������?��ww�_|1&&����{�`0�;��������G>��H����Jd�'�K3�%�\D�8�p����f��O�M}Q(q���@ \&�C���MJJ"�����}�Y���l�b�X�BOO��'O���F���.����$���l���r.����q�?�E~���U��6��o�����u�V����������O>�$((h���>�,��E7\t������j�g.v���~�?��5b=�������&����^}�U�������?������V�#�y��??P$�^���@r�����w$�y����9.�|���O�O9��E ��!���������K/�TTTD�t\\����:��_�b��]f�y�?�P'#,��u2BY:�>�5��4�]����j�y�Z."W�uH��Nb�M���KE���M������u�u9<td=�A��ggg�����9�����@����������T���WVGLz�����_����f3�#�O�����r	�L&}�r��x�Ab�����i����5dq��u������"AU^�@a�@!����d���<��s$7''H39���$�w��[��O$VU�iw6tLj	6]���%}6��d��uC����xi��D�;��)�7diN�,���LN#oI��k3^)rN�����g��\>����F��5��	���{�p�<���K���O�O�'���:8������y*^:;���������!�'���'�����]BY��;���;��y�.f���,�A��*�9���]�3�7����?�!K��n:���Q8tx���y0�Q{2E�O�����ofw�~&G�[G��[n�����f�%j����`�hZ�����u�H�2�T.r�fOk�UN��;<��	�,����!K����~�9K��\� �X����CB&�%�����>f-	�c�Y��=^���DZ�����S�y����x�u���r���p:��O ���A�yhe��r�:>��IsuQ��F�����:����:��w�f-]i[����xJ�������	|�d�'R�2����>H����*��r�&�4<��lv�
�9��FW,r�.pJIP
E>	�W�'�l�������Gjk"�������P�f��v�8��K���!f7h>������f�Z�����y0�As0F��������[�~�U-&o��@�yp�?y0�Q{9o��w����������`^��/��z��������4�5jc*y(P$\8�i_���k������5T��g��_�����b9��d�=�h!���p<������'f�k�
�8)j�:;����&�Z�+���%T���n��J�y��s�����X�^H��������/��g��U#{&���A����.�:��^�3���*q���+���8�}��4a�g29��g��{���f�O(��Ui���J����nS(�t��>�D��p�<�)���
?�y�L�dc���J���~��k3��?��S&gg��74c�6�(-����f�9:ki�78}�|�����F�9s�����^��I�����-�"��}8��=��g��cC��1����Q����i�
�iq�q���b	�����T.�=a�Y����E�?;�/r�3��"W�:�����C���q�\�'����<�J���qU��Z�y"��<�C#���X�T�A�c��u�|F��~`���pqU�.0���*��4���YR�n�]#�N��E�$6�7�!.���G����c��9��t�_�f��8�B �:K�Q	��)3�YG�n���7gDW,z'��Sr����*|��bC��_��������:�Mg������5]�f�� �m#W�I�����C����6�iu�^J�6t��E�f���ES�g��2�Y'�%�fJ����Y�J9v��Q�����`����G�T��I�����>C}���f^#���u�Z�dj/��1%��5�UY���S_��Od��X����p����z��"W���4(d�����*M���V�w��Q����5��
�t���Y��f����6���)��U����j\�<���#JsI�>G�SL�<GY��Z3�V_����9{i?�T�)l�6�O�j��4���%tN#El��R��#cT�k�����,6G��\���"\5�FM\W*N�y�$�<�
���L��|��w�kj���i3��t��xx_�h��!��M��=<��Z�}�$Y�
gl���C�8�{��	s�L�`\D�T#�10��E��} �|
�4�'Ko���D\�����I��q��j�N�?Y)&9���u��&]�8������U���������aS\��o|�#��>��vktXl���Y
.2�.��K��;.��)M�Mk���f��i��0��y����f�Ri������"��7f������ �������qS�{������^���8}�Vp���u=~���W_������5u���:Yk������_�������bEN��Q��*���t�1��FG���2�4+&�}�,7�z��;��^�����7��
CZ��\UE�^�`�)�7����������5���KR�B#S�ev�l5�4����an�.��fwu������|`i�Wo�s�07��f6Xl��	�<������{
I��w�H~�
�ns�R�i�0��&�?����� u2�d=y��z
�z���M��]�"��Tw�K����k4������K�6l��#JKu����j~���s��!&��;�<����(��Sf�;B>�
/��L���eCXlf��S�_U�n�f^���"�yp�q��|C����;1��0�\��)G�+i�����	�rQ����u�����?Q�N�s��d8-�K�h��4-\���Y���y0UbiUwq;��1bb���
{e������\���9�W�5��>F�(n�Uv�� ��^H�}����jj�
��$�Js��>G>6dI����"[MQ�~`�<�eB'��K?��a����J��n���#�%�W�<�>ch���4�/�fm2*&M����2��Q���80m���P�l	+��)�/���?o���J�W�2�.Xi6s^��X��<���kB�1���E]2z�M�1s$��|��R�x����&�9j��L�oh6�3��R�v���8����yp��d�c�^Y}�v���Aad�mYl"6a ����dq�@\�"�k������yp��A ���U�N���QZ*:M����&e&a�������u���yp��A �C3����=��d�Q���4V����*����p~��k���_��q�B�5���3Y��!���J�����n"\5�W�<�@�54������f�����	\���$�W���������c�l��f�0j�`,���0f�Y���w���\fA�^����j�����I�3��u]L���fr�1����+�WQ�f��9���tDr���Eo�L���k��j�z�m�c*��wY[��w�:��zc@�ab�j0����_�����[�Y����5AE�~qt|��R�A1���dQg��]�6�O���s��d���j��
�Ye�����;�!���9j���� ��5��*:�,7��]�Z��nSf��i��o���|/T����9�@�Y�-�[�t�i�����6��d�)����gf�����	kt�>��#��2Yn[��A�_��t�����3�7/Yl����#�N��V��X'I�{�j*:�4MO-X�-c���G�������}��z[��}��
���6��V�����Vv��o�kj����&��]L6��z:S��W#�5*��E6�>i��
���,-�m���6��!�T���r��^1���`��<�������������.������6�g��q:�v��*��}���e��y�J�}�"+��vDc��/�Z��L�3���m�il�r�_���_�2�{����~��4���wY2� Y_�ga��;i%��2�Q�s�Q�0]�7����FQ�Y�������b�9�h�^��Sh��U����6�����F���9<�90e���r����}�c���6������L���X����2si��E6~�`y��D�����;���/4�7���j�	�������+���17���R��s�T=[M�L������s�h6�C���wU]�}�4�D�|�6\S���%�c|����?���<IC��_�h���,�lV�Mg��������Z�b������(m7�J��4��=n��E�M��(0��D�����t=[MA���]d��fk�����tv�5����PLZ��j~�m��b��Vo��0��f�l6���'��V�Na��k���������F�����Sf��_�[���MW=C����������`�����r^'������|{1��l����#�b#��s98C�X��r��	s���?��b�y�,�3�pe������7�M�c��*#Mp1Z���0��^�c�q�e\e��l6y^�6��+X��������bA{����`�����N��'��`����>}�tDD�������WB�p����w��]��Vh�&�g������Ji����*���E�qssS�T���|��ego�l����zzzz{{��Ohh�?<���P�S�=��m��p�=<<��=��-����;
���~�P��-������?��j�T���6::��-���������n����O;{s6*�Tz�����v�l6�����~~~111Z����,�����c�l���??���zfa��b�Z�bqPP���'��=+
�>��q_�K��V�
�t����4���q_����d2����_��W_y��������E}�{��p��k�����������g����������_����}oaa��U���{���z��`k4M�8q��������_y�����������������>��cOOO�������]�z��U`���e������^����v�����~�����|����3_��
���s�q��bbb������XUU�e����!��GFF�z�)����}m||���o|����d��~�;/�LNN~��G^~�eN�nnn�����������w���W�T(�J�~�i�V�q_+��o�����j�������{�=���>		aq-�{�J��q�>}�����i�N��C�Fr���#����'�x�����cqq�,�(���s,������������G�z=��N<��#j����Z���������}_�>}���C���jvG��&.3��>��N��i�������d�������X\������o\����ww����������zHH�N���g�k���x���jkk>���jjj�����^�B^�����<������x_��������&����o���?d�}QYY�c����B�Je�Z
�B������o��
��Z���������������'?���������]����X\������oh�O�NOOW��{��}��G�������
�r���~����+_����?���������U8���~�������'?���S��%J�r��}>>>,�E(������m��-��m��s����%�`�������d���:w����XRR��;��3g���
��Z���8��������g�y&11�Y���D�����0??��O|������dq+q�A���0�GQQQ�>����[_}����"Wa�X>��Cr����;v�x����~����	��e��S�N�L&s%�\��������j��?�c��m���+�qAff�3�<��o|�����p�����{n���?��3�<��~�'E�z�-r��c�������>bw���������?����>�o��o���d!�eE�:�;`�2����g�}V�P������� ��}9}���?��8�{���^{mtt��������L&ck�����Zok!�&�_z����.��������/vvv������W_}u~~��������{{{5����[o���*HG"##_{�5�P�Q6������������3��/_�j������������[o:t���_�����������_~���D}�����R�lll��O*����������x�ZMM��O>���5;;k�X�z}oo��~���o��
W�q����>>>������o��6)�A������-[>��s�^N�N�b���V�����'N��r�?���LNN�4�������c�V1==��������uuud	Gj���<���/�M%>�����:u�<������Yh0}�Q�V�tddd�w�����>��'���������f�V�����fv��m���
9r�,<r��u�_��������/@�����\���*�Tf�m�6�V�J;��[�j4r��������w��������		y��w���j}������X��ccc4M7551������������~��Z��n^���<��C���iii~�!���>`���sss�x�
�^/�~��_��---/��"[�X�S�|��/~����~��z����}���������~���
�Wt>�������:�?��d������x��]]]�������4��?����?������[�P*�n����U���z�������f]]�k����g�!��h4���rw�E�R=������.�d~~���^c��t�����r\�����;w�d��Y��h<��o����������������o>���III4M������w����X�l�<��3��^|��w�y��O?}���JKK�Z?�h�F�o��m�����+III?���������
?eee�=�������={�����|p��vG������g�y���?���s�=733������O���������]�v�����~��@ 8r��������O^�x��U����p����?_XX�,���z��G�]�����/#�h4��(��fKOOw\���E������k���������������\|fP(����K����J%������������q��:�Z}��U��g�lpp���,M�����p�����������'O������X$����\�(�����UVV���Nhh(�<��S�]���.����K�v��m�X�[����K/������*\�#��w�y'<<��U���<���\���t���5���
endstream
endobj
23 0 obj
   18631
endobj
15 0 obj
<< /Length 24 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wp�����\�2�d+�/3v��b��8�sr��s9���(�|_O��/��$��%��EI,��b����	$��I����uC=�R,�����y�����v�y�y���KQ�-�z���W#""�����������pTf�����y��������!�\RR��AFDD\�zU��og����s���������6dGG��a�D�K�.���������
��BT*UFF��s�\\\�]����du��ZmNN�������������4
�W��������������g2��>?22����j�������sqq���V�
���gm����]��Z-k���6WWW��8�A�\���������f����311q����>(�Hx<���[fffnn�^�_��^>�%�Z��+Wx����z��l�A`` ��eeesss���eee��������9/{���Ygg��������!��M|p���3�[Off���������MNN�x�����~������2�������]Haa!�����3����~~~<oxx�l��������=;88���������y<��������B�SRR|}}�7/^����i�`G����x���A�#
�B&��x�����7�����x
�b���g}p�9��Y�>/^��x����1����L���T�W]]MoPYY������H���\TTD��������-���y<���W��a�]\\\\WWw��e//���"��H6X_uX��d������xww�3g������������077�3g�$''k�Z��juaa��+WN�>�������������t�KKK���V��Q��v������Jdd������9������j����`DD���g����|~pp�D"�����>OO�3g�\�~�������S�-�R"���"((�r_[�������h�.���.--Y��������j�AAA��_��O&��������...���������V_���|tt��-BCC����
6?��t������7o�$��z}QQ���)Tvv��x�R�LKK;{�����CBB��������BBBN�>}�����t�Ri����X����t��_###!!!����O������@��oy6��NH��
�������Y���F�w��~���<obb�����������'����|>�,/--����ME�rW�?���#333##������d�m���+W���������y<^||���Gttt^^����y<^zz:��h4��.((���0;;���<o�Q���j�W^^n����&$$l�799�k����<�����A�����GFFFQQQdd$����]$�p�����������V�&$����URRr��M�w��MRAYd[[������[�-������:::�s^�����9������XXX���.\�`�c����������t...t�fE^^���|�r~~~VV��A)���g```AAA`` ����ly�I�<<<���rssKKK���	.(('��vO�V{��e2nUPP���w�����AOO��������K����W����j
EQ��������egg���E�/�v���Rl��nr�X��P(<�����<O�R��T*r�$$����t�Z\\����.v��������dMOO�
I�M���I�������!a{{;��

�G�r��-�����^����;�R������$������l@����Q�_,;��'TUU���
C���u:����������Q'''R'o~^jjj,Z�_�X,����
�k	���<�N->�O��$�����x�����LOO''�l6�&9���<��t�����B��*y�����$=u111�cM&����`pp0iV��f������W��������sg���O������'��'�������iuH;mX�(***��xNNN$tvv��xto�z��@NLL����������9sfff��(�T����B�6�v'��LHH���t:���>H��������l�f�XYYikk��6#mL��"�G���g�c4d��������w������������%��d�?v�p�*��ZHH��G����x����-�Kcc#i�����)9aqq����������F~����JJJ,�V����������x<�4B��SLJo�i�[�T�6���W-7�����C�k����r�������FQTFF���������9�������bbb
���I}�;99�x<����fK�$]�555���+++���r����WM&��hLOO'�"��`7C����Bz
����`�>h�K������>D��������;���+���l;�Q�~������(�VK�\��������K�.���V
[Z��� �����@�"I{s�����H���,RRRzzz,O]:WW�;}!�.]������J�''����������bZVwV����cd�{$5�zN�>M6 MT��@����Z~�����,7����� 9�RYY�a��s6�i��A��rrr�#qss#����9c��[ ��z�/=�K�XZZ���1{���*��l� ���Uo��
6_c0H����gLLLyy9�s��6#?V-k�;U5������4T��������I dff�����c������;:�@��I��y1��������G������l+Y�kYAZ���h�d���\___��NNN������/�<����nQQ����l@.��}w;���������o�`u�o�������6,��gs})vt�&&&*++�B������<~%"g���!��M�VWW�|>���dV��m��e���;l������
+�ZM6�S���3g�F���I��������VBvv6I��S���x<�U�]JJ
���D�9/4z������u����J�9XA����N�����*rH&��>��)��t$�����
�pe�=���r>h��ju
ly�o��"�h�����.�t6�>�.��V��b�+��HZ�����aaa�{rr��������`��e}K2��A:#��}��`������t��
�755��RXAva�����Z����G�N3��D�AwNn������A�F-rvv�$��Na0H=O���yijj�cZ�:55u��O�����!EQ^^^VF���K��l��[��
KG��Zf+�����J��jkk�
���N}�2��t���������Arf=<<,�=;::JJJHn��O����G��522BB��Hrei�#�BDD��D��*+�0::j�\���
��-}�<m)66V�P���fff��Hw����������N���������p���,������RXAv���100`0zzz���]\\�B^�z���i�������x�/��B6�rI�911A�I,������x���z�^�V[$�gVWW
������s^T*��wrrjnn&N�-��H���sNL&��OKK���"��7o����7�L����������-O���svv>}�t[[��`��d���7���rvvvww���6��nR��_ �����.777�P����3�������u:]SS�������B�o���y��<s�LFFFnn.�VL�~[]]%W�@ ((( )InnnVl��s���K�.Y��iVv��>8>>N:������/j2�,�d���\QQA�X�i��Z%On�V��%&&��vww����7����GgH����������f5���D��mH���A*�J2]�&00�����y����J,��������Z���!��I�������>�quu�����Way�[��;����R�$&&Z�����>}��+I���Z�wss��ln�j���T*�guzzz�I��}��y��t:�C


�|��)���]]]-���k��J.���*��\`�a��r_755���www�F���x;kL&Swwwyyy]]���.))!m���L&ww�;=�x�RXA�n��<88X^^���hUP5??/
+**H��F�)))!i~���g��������:�mfgg���I����4��R���CCC����</---������R��rTK��������n8�:66VRRb�:�!f�ytt������L,[���*�x�S���R*��������'|yy���������mxR�����hEE�H$�,a��}��oX��<22BN}{{�e������d��,�H***�_l4+++


eee


���P(�Yb5��8����w�n����mhh���`��x�����/�������&���-6����rrr��6������'''���8<�y�[#0@[[�]?�pGIv�P(lkk��Qp
t�[#+���������������������jpppUU���������5sssiii���$� ,,,==}```�]oDss3EQ����+���333�����,��K(***++#�r����4:::$$$>>^$�t��{������

������#o�,**��R.�'%%��f�^_YYq���R�r�N�
�WWWX�nzz:**jqqqg�=!����
��h�i����RVVe0,�'%%��jz�P(LHH �E6 �8&''���7|�5��z�oA��att4**jyyY�����OOO�������#oD������W��F�qaa���4##��EE*�*::Z"��JUZZJ��n0���,KGQT]]]KKEQ�U�*�J��555����
���h4fff���P*�111CCC����+111��y]]]{{���bOO�R�R���m����$�LVPP �H,���������l6��������A�B��%���ABmm-������������������	�.���!/I�1��YYY����+{zz���-�0++�,WUUY���h���^]]%/����d��`#�wZO�/66vhh�d2eee���`+���z{{M&SZZ�6�����O�AV���&''S%�H���-�<11Aw�d���LK����JKK�R�����`MM
�#*


(�*..�_�444du���F������D�F�122�,OOO������w4J������I7Y?;;��� �sss��]���������s8??O8���9996kkkIO��`������y�&i555���l������_�GF� �\6��Z�vpp066����%''744$''�=�����lp���eff�e�J�!!!��d2	

��7E"Q\\\TT�e��&�I'sxx�Uw+nP�������ssssJJJTT����h4Z���������+����b�V����\���D��� �$]XX(��-�]&���E�'��������������������������P��b�������"##�>OK�ryll�������ZmJJJvv����T*Uzz:���YXX�����_����r8SSS���������XYY���5��[�`QQ����������FGG7,����z�7��������###���			��VVV����Fc[[[EE���^__�a���z��\XX�����h&''�,�����j	Re2���^BB��1;�Yt�	C����(�������v2�bs\]]���Z?y�q4��d�z}\\���0;;k9�����jP/;;{hh(%%���bbb��������$�i=iZ�S'fggccc
���p����fdd�o������T�V�Vj�:::�j�DKK��H[OZZZLL��������xrr��l� I���� �&fff233�'��d����LF�INN��t���KKK����?��EEE��GFF���I���[.����O�������������������RZZj�_]]]t�%�R����#����455�����3+������e5��f�y�w��dVEnnnDDDHHHRRR[[��Y�������!!!iiid��%:�.,,�2��4����������6Y���533c�	f�������z����=z��|�����;g6�)��y���O>�����������(jzz���_���7����� [5�G����T(]]]�>�lvv������?.�
������g�Q��[(����GyD&�����p��V�%�'O����NHHx��w�����o��E:tH"������omw����^z����������d�F�y���(�z���i�tvv������Lb6����?���Ru��9>�O���������}���F���������?djj�B$��m������F�5�5�z�������OB�@@�)�Z\\<p�EQXZZ"+?��Szc`�%���B?����Y��J�������\�B������H��=JQ������"��?�AMM
��	�;����iQ�\��|���e�cq������W<�����h4:t�����N�z���IJL\\���?~������g�6'`��7�BD^�M�T2��Pw�{����2�����s��yXPYYIQ�L&����l����yyy/..r|����J�^��7�=GG��=L���q-~M�U6��!�����0�M��AY��|���s@�I�/Ch/>��h�*��$:o�
���B_�\+Y�w�FK�5�!�v9Z�:J|5Z|��5?DK"�In�n��9��9��0V!��/2�:�)|��1��6��!K5����
�����������o�%����`�:)��,�A��>�U=���I��
e�D'����]� `�c6����=�-6��!K��g�k
��H�A�vA���]�dv������)�V{�;:�-�\��vY���]�dv���B~JG0�M���t��~0>����Xw6�#>��b{|��@@����d��'��<����B!O����M��.��B!O�y2�MC�}���lV}y2@X�,
������F$.�-��<R��[��h_a��B�`�&������W%�����EI�`�����P�W%��v�����]���!?��Z�`z�`zjG�@��i�����`�V�Eiu�����3�"i#��kc6�w��v8������e��L9#)��i2���������g���.v�j�����o������������;2��������/L_�h��)�)��d1�'�m}��������M������5C��u��g0t$�r� ���������,���-���vL��0��S4Vns��w�������l��.��f�r�\�@�/Lg04�M
����C"�������3Z��{bl^�C���8*f�9�- �E@j��������xm_!
�k_]w��+�|GeA5,���3(��y��=�����U6��!+a|p�	>���$������/�h����O��
�+�]6��!+�pTLf��f��;Rzg�@������u��:���
c[��r5�����8*���!?�U`ss�I�{
��D��Zo�2�-0Xt��0��;��7�a��/� ��L9����`s��L
m^'C�>��P�F��,���k�f���2�t�\����.�>%��MH��y�q,� 6aQ-m���n�W��+�r{���s8^�X8Q=�2fss�#��=Mzm���53�������[�nNO,���*�%Y������6%-�|�$�w�&������Y����V�X���d�{����6��!ZC����r�����{D�Wft�����Pk�0������n��R�w�����VA�d-	��
�{"/��B�l"� �;�F=���5�2>�2>�0�T84�'�#�W45�����o��5#������e�-~!?�'�z(�z(?J|������)���1�<24�+��n���c^�\�?���"-�4��"���@�G6�65��W:�����B�=�3��0�����;�l6%�����]�����.����uJ�{mn1�_��������P�W��C
>���$���g�1>�}��:B~@���+L�!v��2���S"Vk��Ia�4wtq������YW|� ����&�����l6�]�VH�B~�P�A���$�w����U�A`_p����)OF:�+)�����o���?�R��T�8V�6Uo��rT��A�db�?g�bG���T.��a��
c
c��f��o�W��>����NuL7�4�* �JVwTl�uf��������
>����N	G�m^�A�]>����y!�\��j)R���/�������Q�s��l��v�2:��B��v��_^���d��2"� �/8��������U�EQc�����Y�;S����o�6�������Q�!!�+�=Jh�:;�I��m���Hj��yy�]�A`Gp�'3��?�88���)�%�2��L9�T8�:�3�"����j�
�%Y����
�3B~zW��k3(�7.�'�5q�$rK������.� �#8����1�X3T�:Y�x�����yw�T}�DMh��@�5O�06�W�J22`��F���L�(�%����G���M���h@�Pw6J|5�76�7.��/���4iCh�(���o������6��0O�%JTh�R�������`|�a�<Q#���l^�@���/�/����g�7>���z�@��"9�<���z�mU���EEE_PYYIV���������-//�� �n>�#�����f���#�`RK���)�vK:�x��z������>:��O~��{�Y�`bb���O�<y������L&�'�Up�'���59�z��s3�Y����6{����9B���d:p������6G�-..&��_~�e�H��q���y2�.��^���`fg�'2�|P�}mb4i��������~��>���c����iu�����E��'�|r��
���6�'�#�'��p��&�����tz�XyR��=�M�lU��>���/NLL����/��W��(j��}t_��������O���j{�X\��mGY
	!�f��8Q�G$(����=~:<_��N���{	(Kc� a���.NF��T�v���d,�i
����e�������j�J�K�.q~�`7�<�	��{M���w�?�RA�������h
�dyqq��G�(���CR���|���mu�`��<� {����\)l�R �_������>hW>��s�MNNj4OO��>�����?�����B��������q���d����;�����t=�%�8g��W���+��L��|������?���?���o��T������9r��x��'mu�`W�<����dz�R����l(��������%OoP�d���,���G�A���M�8����\�E'c�c�W�3���U.�Z��>�#uL7����]uE�M%�$B~1{)�x�n�Y��#V�j�*A�27��R�bA�e�;���d����R����:����7�2F���:K�m�b�.�$�^$�?�����8�����-�f����K�l�ji���Hy2���!?��la_Ra_R^o<����y���+�+�6�Oo�C�=]�Y�j�	�9���`y2�i)������������f4tL�L�}�D4�y���|�������F����B�h�T4Z�T�<Q#����K����~�!��7>�8V�/<#�;g��4�(�����K�ln1v!� �������,OF"��i�<���vm��x1��"SaD����H�[�]>�-� �#8��i�����et&^,:��}>��
yf��>x*>1/� �#8��i���_�B�I�	��Mj�Le����^|����4x��\���0���������6�����	~!���b6���>X!��Q�k6��{TvO����O#�{{:3Q�����=+� �#����w���fuG����QW
�\ZUH�Jd�*��]��3^ �v���
y��������L�4r�3\{��k3�;����:z9�U%��3j�f���p�p|pK�6��&u"z���,�+c1�l%��'!C{��X�,�7]^�__���?'���B)S�����0y2�
��z���6�k�S4�������&=�-���C�}��l�1�d���M�S����S��3���{r�<� mS�A� r���]�9�� |r����-��l����e�������l�)a���sxd��jT��*S������vU�XW�|8"3���B�tcFWd�X��]��A�Lvw����&5�(X�y�Aw��FeJ���C3�`�\0���������
����'pHzI�4��Fv�� OFPV��	�b���u�
e6��������5�b6��UO��s�'� `
���E�����1x6u!M��y�	�I���������l+�IW���1]���j;�t�2�&����m�%q�������:�J�Sd������L*kz�b�V,�&	�*��
�aU�:�E��r��k^c<l�6I�����eH�,�����L�G{m)�~��DZR��U����H(���6�j�����nuU�:�qm����A�8W>����������Zo�!�����k��Xt��ycm�&�I�2�#k��Y���!� `
��d:������;���B�_Y��1K:3M�`*��������/� `
n�d����������D����S.�*�d1�I��yni5����%J��7
�1�w����A8k��k���������Az�
��xq��Mj���d�j�^d�	����l�[�!�J)+�4��]%� `�}ph�O ��|�F�"��93A��E��_{W�x��"����AM��&�a-[�����~@S�~A�o�bs|�����)8���k��t������O�I^,^aC!e+H���oM
h��fY��-��N�Ge��>M^3��+�}�19|p	>����$����u��`�C�|O�p>�3*�62x%���E����Q���m���e��v=�Y�xX���w�LK�|0����_x�_x�JVw�@��;�x��*�T�]N���4�M�w$�1��(����&� `
�A[)��m����������G����Fe�H�#���d4L)��Z�����
>�w��C�}y7	uFm�\+Sa�L���D�D���>h��m�>�m���@��N��kK�Q�;�.��H��U���vyze��0X�U&������W��v`Z/���'c��A���!����M��f�yze,�'&�7�v��\{fW$�aB�|�����{|�c� ��g��z�������Y��m�e�A.�&Z{�r���������{�v�R�S6����L�����i�S9�f6*����MJf�����}	�JV
A�
>��+gB~���M����|�J5���	}y��V������i�B���sL�6�~Q�����������B������fUR��cTG��u�����[m�d����
�&X3T����<���l���+S1��6|p���"Of�)�Y���B�%m.��^y2{�ZU���W2��� O>A��� Ofo�`n�*�I�8�'sB�f���'�}����� O�6|p��<�=�������>�@����A�B�v� O>A���8R�L�8+������%�������&eJ�2����0r ��@�vdC��CBB|||D"�������',,lyy�F��8p�'�������e�O"d�O���:Eq���4+�5����qSa��6G���ZV}y2���������>�l0>|����'O8p 99��������<y���c�6�����]�\��$�+g&���
��$��e�}�����Wr�S��D��J�;�o���x',S���!�v���-,� �d h;�����[G��}�����W_%����o��EQG�-..&/�}���-�����do�(���=�J�W�w�3
G�Y5A���|f�{Z����[����Hyz���[�aP�-V��������&>b4i�}�Y^XXx��G)�:p����"Y��'���q�����^Y=��z�m���J^oB|���0�#�~�4�3\:�CVF4]���ML�I�\��G����%�X�\����`�����j������l8>H������|>Y6�_���(���o�������i������	�;-�[!?�#�d������{}��h�r���HY`O7
��Z�Yk�p�[��"O������������dY��<���E���_����...�.]Z�	SSS��Oqcvj}dACFIcK*n�M������'
`C�3h��:�/�k��V5�����������!�JY4W�L.�]�k�N�9�����������Y�H$��(���CR���|���mu�l��"�[&kYm��:X��p1���������XUv�*G�jf7�$������-�d h;�
������{L,+�?����>�(���?����P(�b��&''mu�l�>U_���=��%"U����Z���������A�R��)��v����?��C�������E���9r��x��'mu��0�2��	�����%&����OG���UU��k��<a�����6�D�A�Bx��n�ml�	��`����U�H��Ewbj��,��Fca�J���8��$3�����'=�>]�d6���B����Ayg�P~�\���Q���������NM\�R<�c<�F�!3���)���Z��zU#'��d������HYM��Nu��� �#�����T6�k r$�wd|�o��{�Y"���3���x -�3<�3>Ad%��n`jelh�/�����
���]�1��[>L��}0�A)���A�	>hs��U�����H�����h�b0LlHnr�~����<��89���6'�����Z�>���� �N�9q-~!?������� �!bV��-Q�V��<����]R�3�j��*b[���7����������3�Xw�i�R�S(u�Ji�@���36��O.�T
�������%���{�l~���0�I�.Rv����1&������Eug+��dB�PAJGH�h)Sa�p��B?�>���D��8�����i����E+�ye�]��.u�X�T�u+���� A�	>�9J�������
c��]�Y�Q����Ec�m}!/�P��T�_�U�D�T�3�o�N����� �a��v�U^�'y2U9�KDl����O"d��`d����j���>�	��������zV}0�.�+?�D��3�bI'�������`g��J��M��k�]��`�T�)���+9K�b%K�lT������l��I��M�\���.�d���A��������&v�~�5�LN��L���a{|�'{�%Y��V�����Y�~�,)�|%��]�oQ�M�1A��	>�9l��w����@���K�L�]e����U�BeB�2���x�	��$�rh����C1(��&p�'��5���O�d����MF���;��~Df��J�T�
��������A��w�4=~����&���s����'�e,�X� �#�7��<n|0�EU����A�z�7���A:O�=���8��&<����Q�a`�@�w3B�P������V\���z����fUj���Kc��
�v���M��^�����E�4s!��<qw	>�	�y2���h���V�_L��Y���
W�?�$b�����^��/6�&��>U�����]��%F�Tx"N����v���}�p5_����f�����+�KyZR0��Td��T<n��%��&���	���VA�4�x M ��{�K��%YL�iCk�����$B]u���������0��1�}�ja���p�Q/���Io0�
�_T����2S�WWD����2�\�^������U������t7	>�	�3b������������N���T�O��cE�D��q]����GW�,jS3��������l�d+&���`4�{T�X�;i�7�.������&�k�����j�R�kH������H��M[�$������F$K���=~78�o��z'�L�]���'�7e(h�w���S&��������Bh�>�9����!�����Q�v��z)��a��Y6�b�>�7��U�r�xeZg|�G����4~EZVo�s��^��&�"Lm��t�W5fV�.���4�,r�7�g�Y �G4]�Bl(�I� ��eR�}7H���c0�Z4j����E���V�s���Ik	��!=����a-��[)�e��-��D��	��.|p��D!?��>1����ie!�Z�xy�x-Y1Y�c04�����f|pn����?]�}��2��2�����=��Ue�!y2�L��M y21�����q6:�4��
��~��+^��Y:kZR�9�m��[u����j|~-a�n�2r���$�5$���:n�L;���
=z<�wG�n��dN��5��{r���~���D�E�{&��zz����}�k����p�'�?e8����U����&���o-��d�zQiZT���
���������>�	����S�=��g����/�Vv���W�����������k=�������Z��L��<����A����m]V�h5^���6��
�31��~}C\��n[��;��0���B�wa�&�Y������Z�����/+��t���K��{+������V��a��f��d��P��M���"�� ��Ri�6���H��;�6��g�	�|����r��6������[�`��H�F���tF��d2�������7���&���Y�r���qj��v� O>�#�:���MR�sR��� ����%���]������c��W������:�Bc���Tx�h��/����K���v8��
��e8)�b|�V����@[
>x'������}������d���1���BT����q�p�0��[c[�.�ZF�N���'��k�'�6�r�Q/�[~��[Y�u��f]����`�0���&���1�$O�U�kQv�����^Zs�\t+-)�9��Uk3�n�����d��b����1C��a�����P2cpK�|�8y2l�����3,*M����k�)"��*c�|�8e���A��D�T��VE�d����UB�2�F90��� |p������b��;����t���Ab�Y2�,���:���+F����:H�-�-���f=?U�<ld6�[1�dkY���F��;�$��y2>�K���Ze���{��
o�,>�Q�����1��tMd5��l��6j��b���,�^��\y���~:#�����>�������}Y������[11�]�_b���S������	�<��%+�A]\���[3>o�72N���)�����&�~�n��������Zv���y�Z�L���$�������R�53�eZ��,�{�k������w�2O&�te|�0*����!r�<�?��yMI�SO�YKX�{��#_e}��_��`U���������c8��	�<����O�hPO�(Vt�%e�1��KJ.N47���%l������]'c�����"n=}��x6�2O�U�3����>=I��$��P<d�f�kUm��2������5�2�1�7�r����U�9�Nw�TKr��kwk������pLn���W��	��c,�d���k����\)����$�J�O�����=l��;�&���k��2r;�������J<���@�������_QN/�����ks<+P����P�$a�b�'�v��Fg��t�VLN�����h�:�f0|?D��('`�4H�����g�M��W���g�3b�'��:�	p$8������t2L����,O�akj=H���X�'3&_{�0=�����M����YB{�.����}_���AQ����������o|�AAA��U������o��,SO/�
g��J-�:/��Bww���?�B�P444<��#2���C�����^�(����+�S���!������O<���������@V:tH"�����������e��A�
����>������������(����jo��������������`�������������OS�o�>��v
������������T3k$���y2�g���#6djj�[���V��Pu�����%���O?�����H��j:O�A������,�������E���?.**"+�������� �y���3jY	k�!����`yy����#��/��BII	EQqqq�?���S�~���>|�`����7�=B���kx�x����F�����A���x�����zeYY���Wpp������5�8Hl�
���%z�)��d�|�.�mtK��^��A��}y2v3l����d�]�y2���	����=1��q��q�L����	��
��`I�Z�~��� �]�_�gi�bU��J[�����s������
f�����\4#��.(�Wr�|�X����u�j[����zeR�2�EU����:&��:eq�jz�h�R�X���Q�M�����t� �u��a�v#���q���ar�8�`�f:L���R�dkm]P`����XRi��%I����uA�
`�'��
��K	l|�^�m�3 O���mL��'`��<{��a`�01��g�%3ks(V��K9vIy�K94g�^\��O/*�+&��l�R���_\Q�uf6�_^UN.�1�������',,lyy���pJbb���O�<y������L&[�G�-..�(�l6����"���Gp����'�|r��
[������B���===�o���=���m����k�Z����r��%[���J�d����NLL���������T(b��������>"�;����9��<�������>�W�b���OQQ�k0bcc�rUU���wLL�J�bv/Vxxx0��f�9++�������^Yy�b2�D"�������/]�������/���8�yW�T!!!��_��d�~2���p��n��������_��K/�������~���}�;��z�*�{����������K�.:t��������w������vQYY�������S��(�����>���'^z����4����,//>|�����?��www���~���7o�dj8�;�a�{uu�s�=���|����o������>���g����8�������s���������}�����Q511��O0���|�+���E}�k_'+�}�]�g���>���/��"����zzz�/��������g��

=v���l�\)��|�I�v���S��������T(���/�{�=���>

bp/8�������������LQ�Z�����t:i8?������G%m�������<��������������?Nn������'<���
��,�L���{���+��>>>�N��Z�P(�=#8�;�a��C=�V�)�rvv>}�4Y��������^p���#�w�������O�:������W_

R�����G�ep/...o��F}}�������Z����{��g%	�{!�j``�SO=u��
6���G�zzz�O�����������}������/..�L&�V+�J���?���kL��}�8�y���������'O�<r���~����	�\~���������}���5===>>>���
����c=������awA.���_��W���/=��#?��O�����%���������/�����]�p�^3??���s��1������G�8p`��}���?t���L�/1���q��������._�<55���r���Gy������}�8�y�-���_��W�z����dz%��M&SBB������rPP���>���=<<��]��������>���z���x���_~�����]��?��s\�z����?���o�933��^���w���^��g�������733S�R�������o��
�]�����SO}�k_c�]��j]]]�y��x��������~�m�;E����x�
��z���:��G1{���������{���C=��_�byy��d<�l��`pNqq��O?-�JGFF~���dee���^N'N��������9r��W^������������.H���0���B>M$���}}}Z�����������ej���/�����rqq�?��?I$�R�����o0�R����W^y%11�%7�����7�q��	2b{��
�J�������7�x���S�����������NLL������}��G���������?������������o18a���������YXX0��F"�|������:S�p����W_=w����jrr��o�I���s���}���8q������t���d����t��Y�/���������(�w����c___�v!���y��_|�����a����^�gzyy��T�����	�p��p///�R��>��CL��.����������O���������]Xe����V�l�������AWWW���������~��tn���'��?������vq�4����3�G��x��RI�;::��_�ett>�=g��y�����������z�M&��o���NMMQ���F��{�=���?��s�}��B�`�_����w�}���~�!���>`����������F�
�?������������]XUJ���7n����~���O2�/u��q�BQ]]}�����L*�2;��������&�����?���������������������Q��?����?������Z)��q�vAp����'���������W^y���+|�ct:���{#,���O?��;��C�Y^^~��W���\]]��_��r���:t��rg�Ry���o~�����������^����~������P�����[���O�0����SO=��?��������>���g�y�����]p3D���~���������^JII������������'�TTT<��3�ww�������|������(�������:q���o���3��������{lpp��],,,>|��W_�����|����;����c�]�v��]���z����>[\\L����z�������������J%�/�2������kBBB|Z�z���9�VY����Raaa}}=���o�T�u������yf?������������b�Bq��M��3I�?00paa��������La0�������}}}�1I����LT
��k������=���9����Ol���� 8����G�1���bvv��^(,,do�T��[o���������g�}��G	9LA����o*A
endstream
endobj
24 0 obj
   20229
endobj
16 0 obj
<< /Length 25 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wp[�����\��f�qF�c{3���(�6�������L�������nf��6�$����-	,"!��"%Q��
�b��� ��`	�7z�o���`A���{|>������V������s�v�&�)<<<))��|ee�������enn.555((���/,,����b��joa�������u*\^^#����Y,��B/III���&�i;s��3g��l�����a����Zv[ \�p�����f�d�MKv Z�6??���3~~~�/_���p�������������!!!���z�����b���

*--�Z�N�ONN>u��N�s���������)
�B��t���M�cxx�����IKKs�^�/_��X�:��===�������
���_ii����l�O��r�w��]�����b����L��%��~�0�.]r�Y����l6[ll��QQQ�W���-..�d������Z���X���v���L&���-((p,T���7��N�

b�X���*��`0����X���qW��U���o#������[/6;;�b�����v���kkkY,VQQ�%;���
�:55e�Xd2YDD���"�Y,��S�FGGM&�X,&w�����C>������R+7��aaaw��p$==��b96GGG������7�	��g�X�����������.����r���b���o�X}}=��R��w��]�������	c�X###T��������&��7X,VSS�@CC��*,,$���oee%�\YY���K-YWW�b����\u4��&���jii�x����_pppee%5���fp���*�*33300�������&�iqq1!!!  ���������Z�N�����t���'���BCCKJJ�F��vree���722�i7�^�*��7�g���X,�����O9�������'N�<y����*��q�������S�N��������X�D��������O�<y���+N�H����O�7q:�Dp����m����d2jl���s7n�XYYq<�k�����::s���M�d�Z�������]�&
�N�R�LMM
�I||���<����rt��������������d2UVV�?�TQQ��x�F����=u���'��������N�8q�����<�F�����8^���tc	�����������'NDFF��b�:����.��T�������X,�>[���b�.]�D�O?���b���P�����f��������6�M>���PMEv ��;{�lPPPAAA~~~@@�
����K�.%&&�������X��������������g��X���<���b!�TLLLEEEQQ��s�X,��PMMM,�����puu�t�l��d7���jkkI�����r �oRVV��pX,VXX�����,+(((??���299��b9�.�5�={6???--�����	I-�x>������Y,Vtt4���������'   �&>>>�����p���'O�����r���|��9�_#���������F???�rs�����b]�x���������S�k#�y�����������(r�����;@�.(((""��������YOlllyyybb"�^�{��/�q�������3g����>}���+��������K�&��6��N%v�}hh����\���"������������(6�v�R��q�Q�V�Y,V`` 	�(�V���j�d?I���E>GEEQ�ll6��p��c������dpp��e$��RC��;�q���iR��P$�X���xj�@�P8�q!�;�N�-��@2dRSS���-�}�@dd��l&������'�:��[5++��������q�j7
���14�AAA>>>��&''}||���H������x��=�s���I������CCC,�v-6�M�H833�b�BBB3//�\e��F��Hm��9���O;���n�����*�Hz�����j�V+�������X����l��Q��;�w�5��g��9y�$u�T*�����l6I��x����N�t�����H�������b���������b9�[[�V�;��333������'O�\XX���R���b���m���������������R#w�;��b���:�l�kkk===[WV�qw���M}�j�


�L&�FC�������:����+ubb��b]�zu��f�yzz:!!a�����W�i�EJL-..�qUqqq,k``������4�&''o�������\),, �
6BZp����m��T*�TIYYI[�����9���L��H����6Kxx����Pccc�����T�;�����������f�����Y,V__������y���������VQQ���H�`�z�����b9n�f�9%���x���kkk���t�Rxx��j�X,yyyd�d��*p#������J���������?I��-����PSSI:[(n���qw��M}��w8����-~E�}p|$�`0�6U266���q�������KB'��7$]�N�6��H������F��"-������A�g'�����oz6�v���h-vww������������t������n�����Hj���8q�,@�����^>��r�p����Are7���a�c���t���m�����bjO�/��'O:�|�D���x�7B��d�555AAA��l|���Ecj�g�Sg���.1��$�����iiiuuu�Km������o�M��V�,�$����Xu��,�������p8������Rulq*�������q��x�,�P(LLL��vl+9�k9AZ�[�h��������P�~���<r���;~6=:b�����w���l���[������x�	N��}�\���������~����(6e��������>�����T*��+���T'i�m���R��l6�����
��6}��f�c��u	�v�eR:��,p�>��'O:
������������n'�@�tN=�d|ppp�Z����$m��������X,�S�]NN����p&�ixx��*��:�T��d��xo�B�hll$��d�l��w�lzt$�����	�������v>��������pG[g�m���������=�Dgg��7��H:���G�6��������7�x��s���$�Qc�T��=� �=���G�
�������,knnn������b�RRR�*K�R�.��MlE��q����f����G�N3��J���:'��A��7� e��"��H��f�����<l}�:::���6������n��n����m1th�����������J)��������3��i�j������aC������������c�9�������pG$W6((�1���������l����\���|�&&&Hh�XH�,e|�^HJJ���;e�&''���x����}�L�����V�U*UAAz�g���#?��Fc?5Bt�_aa!�������!�L&�hpDDDiiiaa!iM�����T�}

1��������~~~�o����~9��������I���*d�*���333$��q���F��j2�t:����z&++K�R��jR����nq����Z-�����
�d�dl�q������sb�Z�lvnn�����T*�V���4q����;������������'zzz�f�\.'_Nj���9__��������<22B�v��x�YS���������;��y��������FcGG���_PP��l\~;W�>!+<y�d~~~II	�wbcc��o*��|�9Nyy9II
��uf�Z�\�r���Q��=U�N��>8==M����]��~Q���8Q���o}}=�B��udtt�)7���@z���d����b�_�������������^�t�xuSoZ����Q������,�oC��D�����(����g��x�����K"""k-b�N�D�{��/���D�s
������
���w�v�������\.���������_I���������3������Z��i����OSI�����}b4�v)&&�q�S������w<3�����|�7>U��<��4j��{@n�������������/No��j���������2�iJl�j��?>00p�9�7=�%IcccCC�����9�����l���uuu���N���nW*�|>����$������j�������������Z�_���L&kjj"�"�f�I����8��6/\WWW]]]SS�T*u�2�L��������������j���M��l���<�������qhu��p����[\S�F#
�����|uu���������mzQn����d}}�@ p<(�����=��Xl6�����"�����v�������&�H�����Y������jkk���6����~9��T�n!�}�����j8x:�S�����;`[X,�.\�x��'pw�Z���7q���3ccc>>>�5av����������S�e���q�>�����an���H�h�|~OO������A��q�>�����5--->>>77wjj���]�R���666:�����x<����������n�:������722��&b7C(��v�P�X���XPP@^`W[[��
]Beeemm-��P(jjjRSS���233��h��u�L���������B�v1::ZYY���B��~���f3�L


I7���!/'������������uo���)))�������@PUU�V�-�����V�]]]���)))f���<66�����/���YYY����1�������M_oM-��M��l6ONN������������y��������b�X��Y,���������|��i����T�Db6��ZmMM
yE��lNIIqz�}KKKWW��n'o2�j�F��������,p�r��RPP@�GC��������m���_��������H���<88Hj�����M�O}��u�\^^^.�H�ccc[ZZ�b1	m6[fffss�F�P��������}$477?���IKK#��_[[KJJ���!����4�t
��VXX899�X888XWW�������scc��k[,���T�JE^�a2�H��lNHH �oWN�/==}ll�j�����.����Vknn.�6��VVVF�AN�d���l��.�H����;33Cu�����G����jjj�R�S��� ���zD�|~yy��n����^�466�����F������A��%99�|��������466F��(4MKK���nQ.�����:;;KJJ�{�0�-X]]���"=�J��4pD"Qqq������f�h6�����/b^��DZ@]]]w��������#��N�?9~��l�att4==����egg���eggS=�����lpMMMP�.�J�����eee��r���;�o
���������-�I'sbb�Sw+����n���X9������������bq�SCC�H$��Jl������<''G�P����� �$���ZQQ�P(�].����R�E�'[[[�sNl6[UUUaaa{{{CCCUUU||<�������f��KNN��<Q(���+++w,7999EEE���k�j�yyy$��bii)66���t�w���q��Wbs������kkk���6���>XYY�1?�n�E7������Mjvvv���l���������	�J���E�imm-%%�b��������n�������8��l���
�H������fgg�xP�baa��%h���V+�����8fG<�J��2a����F���"���[��J�JII����>h6�7���L����� ��$v�+**���qt������U*ljj"IG�+'MK��	�L����V��x\�������������nR3�j�i�J�����:=+���E�7R�������FzG7�A2�7==������?&|�$���������&


6&������4�\N����v�t������p?������$k������$I��+w�L�D�����QW*������uuu���8�W?�~I��h222H�#e=���d����/,,tz��b������S%%%IIIqqq��_����4KS,gff��������.1�			�9��)���������PZZ����E���JJJ�����l6[eeeSS�
������^x���^x��PH
����z���{���'?���������������{���/����{�zx�������z}ZZ���v�������;;;�j��o�}��������������{L.��{�:Q*��=���n���:|�0)��_����v��$	)��o��r�������}���NOO'����I�^���G�v���>j0H���oHH�[������O>����������3g�l6)�X,>���n��g��b!����g�������9!p���������NDD��!c�v�}yyy��}v�}��}+++����>�v�������z����|������������^z��TTT:t�n����?���$�����y<��v�������{����������o���Y,��������?I�������������<x�l6�{�z������III������������;��jkk���ccc������hCg�T�����t&-)��0�Jf���g"tF�!��"��_�T�2�lB{(��B���#�4
�B�8�j�Z������9|6�LKh�Y]s 6�mjE�3�*���(����K���zz�%-�����jN�Xo@�����i=�z�|�+tA�f�i��E?�u���_b�X����=�R6�%Q2��S
���v��Y�/q����0� ��BcT3��{�8|������X�/�
��v^��+\������J��d12} �A�2�V'�%E��5L����a�X����(������Y>�a�h>����y��l6�q
y2�eXm�e��o���g����k����E����/�/A�b�;H���N���Z�)o��\��� ���	���C�n�Xy�T��i]������f��|R�4�:���!�HW�1��=���H� x
I8|��������z����D/i�:�vQ=���S:/�MZ=}���2�<���	��m��t(kD����K4L>-vgvs��i]#� ���e����?��+��4��e,t6H�G�2�3�x��m5�y��Y&����Q���[�����>{D�����xj��\�P������g���9z���������Z��D���e��������kn��!G��`4���U�2����L�4JK�����M���Y��hkF�R����N��N��������@o8�"_�)��^�C�����*Q�s����T���Q���qL��Tw�T9(�����B;P�A�}b���%�^j����R����/��^aB^)� �~��W������6��{|p?T��9|6����������A�����y�����|p�,���^�A�*���R�C�Nq`��d(��>�O��sj�������S������<����)w�C�{����$������(�)�e�b���QE��LHc8����c���gvs��D�e�e��J�C����dv����]����8��U�7H��
�E1�9�b�x��������{�2��p��R��>�@o�9�KhK^&��uQtt�����-��L�(���x0=�+>��`|p�������'�M��M�����r�
k%.hF�������!���n=�p���W�HndKPr�%R���O����K{�����[OwL72w�F�}5#yY=��A�Ny2w�����i�q+�������^Y�4��A��y2���nw��W�`Z�l�T9����(��^� `��<������t(�x0��Fv�j+S��������{sz��	{��B2�"
R���������]�|���gq�������
�xE�����^������;�`�BG�L��1t���W�[�����
�f�b�����4�������!W4������5y2�����b��bO~����Z��+^�at��Q�1>�M��������t^"i��W��B�E1��+,�f�b�g[���M�A����<�QE?����?Y���i�O9|v�H.�&���Y>�uMw�M���L5��w�)��:
���E1����p|I�5��9�$U��+��mY7�� �����nVt���v�r��*��h����}�N���Z�0��(Q��JL���b�8|v���#����D��v����]	>�w�7����s��W�������z����y��� Qb{X�C�>�a~R	:Ew���x16�ujEZ:�� -!��7�Q2D[X6�L�1m�#�����X�9�W��W
>�c0��6)H,�����-An�0=B��I-U����VrKe�������[Y�)��E��"��}�5y2.S��,���|��!dK?LR�F�9��|�(��;�_
����>;���v#��x|�!f���\fR^#?����q��8�t��
`�a�s��e���������F�{
�Q��*w������n����X.��<v��eu��TEx�
C�T�r4���E�j�}����>�7��+��������sqm!n7���������Z��]cH��k���n-s*��$�3��DW�W�$�����S���<�rpD�7"��	���>;�/Q4'�1l��/L���v#�_L�����vf}�;T,�\oQ3��Y��oV#�&O�^R��~�$s���&w\��D�j���$�dw�eu1!�������_4��f0�k�d��9|6�'�QIh�aeN3�9�Q����F�>�3��q��`�|[�8���&�T}�$wM���<q	c���w���x1����OT2�������R���]%� �����a�����u���.1���Jj+5���]%� ��k�d*Gr�;.���c�;���)�8��d�&�Z�Jm�L�`lS=�$u>��[qM�Ln_���.��h��]se���dEi|p�>��y2�����'.� %�p���?D	>���*�I����?I��$u�v}��@�+�),���6�>v��%���N4w���x+.��S��T���U�l3*�MQ�Z�����V�j��#���[qA�L��xPQiXu���i<�����^gB�'� ������n���3��qu3���>�/��Q<}c9�&1"� [����������l���T9D<�u�&�7��EZBN���I����N���uj�:!F�	��:�"���s����D�T�2��G�QS+������t�Do��Lz]���++��l��*�Whtr��x����k��-..�����������j�������'$$����~'��ebI���kF����(L�gx��Y�����g_k	*L+���O����'��h	���!O��}0<<������;�����O>������l>x��o�q���}��egg��v.���O;v���_?x���ju�~�]K�|��.LgnX�5^�I
�E1=�W2���@���/~����|>|�pxxxSS��/�LJ���^{�5��~�����*��n��^x��v"L��_v�������L>y��;>���o���q8�#G������/|�v�}��}������?�z����6���Ah���>(
��j4��g���lRh�X>������={�P}�����O������9!��h�n/�t���Hi�4�%,C���r�)��J,�B��I$Q���n!�~��_i�Z�����O��z���>j�����k0H��������`�;/���%�no���:g�8|v|[������,..��w�3>>NB.�{��a�Y"�8p�n�8p@*�����z����~?���k|pD�W;Zp]
��-�z���|���
U�������wvv���w�}�������|�����'�Z����o����Y�'���f|�u��i��|n��Oh-H�1�(�$��������2&t2�B/������p ((�n�_�|y����<�����:??o�����_y���z�������r�N�]�|P0Y����:.���W����'q����K��yt�%CY�V�I��f�|�Ha>6��	�=Q]WIc�`ZA2�?Z��Um�q��7"�+K����H�p�e�����F�
��G�]��8E��� �������+�����`z{	�S��*�r������K�+x���O�b5�Z�Z�Y�U��r��T1��	�K��=QTg&�Jm+�|0(g��U
���.mO�����wk�]���R�����GT����P���������4�z�Y�6U�����+�
yc��:�-��%qr�E�M�\
�9*����k
�5L�9-��aFi���3x._3�d�]���(��1COS����k2x��|M���q@W���������
���v4�>���4���v^���2�):ZV�*��D�8���rx�-|p�������<Mv��Sj��Z�z����B,�	��������6�K���n����\����pt�A�I�lV�oxQ��bH����y�%s^)[q���|�ui��u�
����q@�"6�����7�����3��%���<0e,c���C3FF�eR��>#�I�l6��hN��u5U�)	{������NWx�.�e&x<MY�|QV�zGYi�!��)�`v���4�[4�1o���gM.��#s&�)-���5>�>U�r*Fp�������^i:�Q��L�d�RN+����]�������9�u���o�w�
g������/��������X$]a~������Sv�����+��j��t�'��1������I�5�L��;AYUR��z��	����xC��h�f�g���6�`D�6�.b��!�a��0<k�7N+���t�z'�\>����-���o
>�K���>��$�>�=�g��>'d������t�3���V���x���>���L{�M��sZ����.���z'������mM�< 2j�}�������x��vT����0N)��>��,O�����K�nf�\�nS�v�B� ��������p�vL7d���=Q�A���
y2w%� � \�'������`=%�w��,�C�&.}��2��{O3�?�����pY��p�>�M�<���d��kJa�n������vMV�&���p!]�`T?8c���Z� �4n���z�5��-����p@&������;4�*���������z��	�1I�S
����4�>M\��~BN[�eiQ�A�Y�������*��������<QIW�/_���s:�`E�6���)�a]�X�"�{P�:�����1/���,��O�M{�5���f��M�$'���	?%����DWX0���Q8�Z4��+������|&���4V�t�=:�9O!�s��jI���et��x0��j�'i.��)r����	�F��rham��PkRs�l�����v�l���e�W��ta���&��j��#�^�]���1�5������nN\[s&x�+��$�kL�X�B8fp{%A�\�A@�@R������%q�\kNo\��
��Ic8�$.�(�*����"��(��8E���dB~������0Ar����.�J9|v�h�O��H���k�lL����C�6j���;�8�],����!Ys&(�������\�o�����\��s�H�� w>hD��1��X:��'�|�&x����`y�6W����y� �4���)��.�U��U�~���$U����)���bk����s�@�N|�E�,?[-��e��kL����� � ����Fi	|p;B���]�<����:Ia�����T1HW(�����kI�����>�<�!�Qhd���>;���p�'���M��F�N��D����L���e�w]I��5mu�`Z��2=�@zPa��t���E���������[�e/�� |�E�lsz��lQ4s�_�l|��]����h�D�*���C���3,���O����#tQ6����3�"���:�Y���~=�M����[4����"���}�����.�����/�V��;�=N���9�����K�_����� >��X�q�tg&)�K��Y�m._CcX"�"��v�����f���GI��*oH/���2S]�>���1��0!�
�Z(���,�����ng�}3�����I#5���,�31B���mQ�T�p�����+e��[�L(W����%�A�	>h���>]\p��b@./�z�����nm~��tdn�q ��-p[�T�%s>X����]O��� �|��E�Q��|0W�iDA�	>��X�.������I��S��E�N�T�I�a�o����� �k�=�-�����#�Uj����!]av���9|v`^|� |p�`���W�C��:Ia�D5y������0�q<�<�r��7���t���� O� ��
���+����l���;�$���q��)�X����//Y�+,�\O��oG�� ��
(4>;���k|0�IuC����4�k�����
�c��n�hoj� �6�wz��o�]0Y��~�2}4m�����.-_l���Kn��Ic8�t�]A�7	>�K0���� ��s��I�Li�-!��w	�>�3�"���u���������������y�����a�J�S!��u�-td�b8|v�hA�B]ad���I�O��>L��!� �w��9��z�x0�d(#��jb{���8��wZ���n*O>A�	>���fn��P:�/�� �C�%���r@� A-��.>A������a�|�{�)�7k�'s�	}��@�`�H��/6A�6�
��q��+M�(�bN%)�1�A#��NC�]�]��44��V�p�����.���z����A�|������a�B��7i�&m�`:��^��+Lt����LQ����ey���n�s���g��Z*��������b���*�,en��	��k��4����)t�B�I����%q�H^�,���?����p�}Cn?L� O|p'0�6���������y2�j#Np�J��<�T�-���v��$�z+A��>�p��f.vh����C�������V��q	�<O�A�3(����\�/z��$���"�0�v��
��t�a^A��	>�vn��q����f����e�� �v���n��{�Z�3<��������s���P06�.��$m��>�&1��������A4
>��[�j�����]W�{�{�Ut�5#5>����#)3L�`L�ax�,_�(��>Y�C����b�JW����[� /|p' Q�����
/�i�s��i
3�!�,n/���r��P�&��@*���I���Bch������!d�BD���ngP&��O*Hf�MY�E���j�8]W�eb�\�,��"�?
����� �kt;���8|�uQ������E����j�����������4�0����$&&��e��#�W9|����K��L(������|��A��n���~KB�J-�X$�C��_�kW-��4��{��v(�^<�>�Q��Sa�T�1�]�3	F�J���������y������Jo{7A�I������}fFk��!���|�I*���~����������d~~�n��������{���������t�9�k^��f�dE����A�)�
��t3���j����3��=!bZ�3�
�_Gh��m�&z{{���o<��$�����gg�Z�~����9b���|�Z�����c���rw�5#�6�rO�������TEu/�����>J���2�����hk�z�'��QZZGM�����i!r��f��r��uFF�w�������S>���u��a�Y,��_����H$�����-��u��2���+�.p�l���%+QUkh5�~��a$r���u�����z�����P��_�kH??-�l��w�����j������1m���\7`�Lw�Gka�k3XL���3��������g�^��#����G}�`��$���oHH����A.���������;��U��D�����(���5y2���b��������74[l��\p��O��������J,t�R�z�?I�1���<U^�����9�f��g��������={�X,��@``���g7�annN����>�q�����<�����������2ib�4�f��0�a9��)n�kl3��zif�H=_���z�dT���j`�����]������$�eL�M��>�s��u|�//�~�X�e�@ �	U��DN%��0}���������r82&h��������g��������B
?���k���k?�r�
5������iZ���y�����GGW�0������\�'��;M��5�����.���s��71�`4��&���C���{b���p���{��P>�����K/������?���+++I����}���=e���+T�L��~`�4�x��F�Mt������.Y2[��������L#&�M�j��C:���3�`�'��<V��'C�|�b�8p��7�8~�����IJLFF�����?��������f�����r�!O�]��C
�K����:cZ�!�X�!5��-[,V/IpV��L��'o�?N��V��%����T.2��B��{���Q��rydd�����}������������n�G�q��a�Sy�W�5�C&������g����i�LW�6�-^b�����%���3&���*���TabB^���� �'���(-�,��w^�8t�Y�,6������1�J�M?��	/cv|�>8����cj��g� �ts�3����q��a�=([�����^c����<�+��`i�qd���d�B����'��d���Ut/l��uf�:w��EV8U�������
�n���a�3>y�vC��K�L�y2�r���Y����K`��T:�k��C�����[]��������f;|�'y�0>���X�k�e�QEcx�t�b��{�|�3�'�Yo}���s�i�l���)<#�{Ic��nj��@�.��N@���	,6��j�]��������DW^���'3:o.k�n=�@W�7e~/A{*O���	\
�d ��dvF�>��|L��y����g�5�=�Fo�����0�p���M�������*UV��JW�����v��v �<W���	����GrR��!�_����M�+V�
}T��	�i��>y��'�CP�r@�C*S.7\<�98���3s��d O�w.��5Q>������<���/6�M�zB�k�T�Og(���w���ngU��3�"��1��T�|��k�d&��?%�����dT�Yk��2�aD%R}v��'�/�8|v~�>�8>��#��k��L����Y�
��O��4�C���+f|�v��'��_�*�&���t��'��s�s�C�
�������C�-�)��g�LMk#��G�iG��?��YH�	?�N���!�
k��I>�Fo\�h^�h^��
������A�6��L����+D��e�����4�������
����^���u{��X�	���9��s������%��%��bapq;]!+S�a��������f4����.k��o�'�����DgX�����5��P������N�A�3��-k����!<��q�Bg���l��f�[���%��%B����Xm��z�Z�����}��\�6���e��B����.�����>X���~���}�r�4�����.�d�N�T�q����X�;{)�v�	%��EU�2x��i������Za|��tH�T��|�J�P����m��wk���t�e��\�&������� Of�
>�^:������\�Wt�|�Cj��i������i#]���iJa��`�6��w��L)������L�3�>:��<��!����!y215k�f��(��5e1�����hFc�;��b��4������$4�c2Ymu����?L��'1k�kL���4y�Lt��q;R=�'S����~/a�a��4	�
����h��`�d���jY\��2�����zk6����4�9�^�Mc(_�\��1Z2��Gy2n��!O>����Y����M�����O�f�����i�g1��5��gn��aT�!���t����eM�%��b���8��0���`�`���z��lu�Z�z�Zo�=Lh`���3z�t��A���'`�P�g�m7��lL���������g��R��b�)����<� �#��������2�B���A�������
+�����S��q;.��!����c�m=`|���<�O��o\���9t�k:�Tfe�
n��|��c�Lj��j�EVkr��V��������p��IkR�?�k^�|�rZh�.�'��1O�� �|�'��PV������-�6��T�|�A`|p�#8����&��j2[M4���?LZ$y2�A��=�t^JhU�/�����t��-aG�%�A�g�<�]���(�-&T&���VTXx
���U0��7�9|vpy&|�) OfW��"O��`|p�P3�W)��*�sU������P���|p��r��g����q�G{x$e>�,�,O��<�-9]Tv����dq���rz��S'IxC�v����p}����<2o����
�{���5���s���|�*��.��4j����d��-'��_���a�_���.���e=&w���'��0���5��
� \0>X!2U�M����B���4<��LC���8���8E�I�-}�Bs�]*j�V\�'�������1�����fMO�;e�^�aV��?8��t�<���\*�I�����n�{�(����X�����������/2��@`��,OFoB_���[f�*Kof.�jq����d��;�s��\����F�ah�����
�4\0>�:j��ZT:��n�g4��5
�UcX��������<�N�����~se��J��=�O���<����� O�.y2v3���kA����d�fn����'`�b���Jdf�������Fh�$�����e�4`��1�R����~�\����N��:]!�j=����}��m��n��5mV��7�w�����k�Z�E�m���|��������Uw��R�\�O<q����_����V+&6��8t�PUU��n��l/���@ p��c��}������~x��Uw��:���C���>}z�2��������X�[����w��` ����.\���=\���R)���[oq�\w��:>���?��Oj����s��}�����#�uLOO���+=��SO=���O�������������l6��������!!!iiiZ����8D�:m6[aaaHHHGGU�p�b�Z��k���={��.����/���x�u�j�qqqW�\��������������L�?����7�������}���f�X���7���i�������^z�n�_�p���>>>�����o~smm��M444�_x����M���S�N=���G�}���sssI!�[Y]]=x��w���?�000���������o_tt4]��u�[���755}��_��g?���o������w�}�YjRZ�u�>�t�=�����_]�R�x�={�LMM������'�|���|����������������I��o�M���7n�x������o1z_���opp��������k���~_�������6�����������k��w��\��|�+


j��������R��;������\���M��s9������n��t<���h$
��~���|�_ m�'�xB�R�B�u��E�288�����o����x�WNx����j5�l�Z���?\�t���������w*T���^\���k��#�<����v�����'H���OHH�[�u�>�t�=��������;~��O���_~9&&F�����:t���������k���>>>:������g��H$4n�|W����~���W�2q_:t�����L>&��?�����=�����'�x���jyy�j��T��{���?�k��w��\�������Oz���W^y�G?�����B�8x� ���q+�������G388x������Z����?��#/�����8�� _��������>���<��c��/����B�&������~��}1>>������sT�R�|��7��9C�V�\��C�����g���{�8p��>�����}�x�u7�L)))/^������y��'{����07��~�x�up/��������~����l���_�V�5+++""buu5&&�_�������D7�&~��f+T�QJJ�3�<��C������4n�b�����d�#<<��'�x��G�|������g��s���L&���XZPP��j��_�u�������
�]PTT���O�����]`0������/?��C>���O?��[o��)�����k��f���>z���������\Pq


}�;�y��G~��_����B����������z��g�R����o~����BRN����������>���W^y��_����������E�&H�����	�m��M |��6===�=����]����~��VWW������A"�h4�����^{��M�ILL|���\.Cn844��/}����d�����Z�6**��������������_~��������333�����'z����Jeww���c�X������}���ZZZ�������%�����%����������	o��^~��3g��T����7�|�$��]��3g���s��Q��N�������V�����S�N��u�����d2�������z����_hh(]������������NJ�h��f+d����$7����C�k���;GV���G

�#�<B�&����y����y���>����lff�l6��	�^��a0����w/�7���O
��������>G��;v�L�����GFF�����Y�����MxS���Ci4�������'''����������[LL���k*�Z�o��fJJ
�>877g��{zz�6����������W�z���Z�\�(�[y������}�]���w��qf������z�������?'�����=�]�p������^����������1_"%%��7�P��MMM>� ��I�Rz��|��g����sd���)<|�0�_�'�xbxx�|��������������4�����}��7�
;;;���O�&�Qq=���T������/~��'�Ac4����aY^^~��g>L���������~���������������8@u���F�9v��W��F������o|�_��|�����������k_���X�l�>�����~����{��_���G�����\WWG�&\3Dk4���������������w�������}��~������/������~������;|��W�n�_�v����>z��[o���/Y�P=������tmbii����/����#G�l�����������/��	�wT\����>�lUUU�����������%���4
�/���l�%qqq4���P���*�[YYY���hmm%?���� �J�o����T*�]������������v�jutt4C�L�����ZZZ���mm�{wh!�t4�X��0�&(���n�8$[���\N���yO�|�LE��}��
��q1�������n���$�g�z
*.�����k��� �a2�E�5��s�� �B]��y�������y��E���5M3�c��m���L���m�?�	�}�
endstream
endobj
25 0 obj
   20590
endobj
17 0 obj
<< /Length 26 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���p��7~�U�$wg#��8����-]��/��g�1�[���;���(9�'v^S�).�6�E�HJ$%����
�
�U���,�u��y���-��g~3��p���������4
w&�J������|>����h4����[��<����������=u���������\�f=���\�MXm�Z}������;;;�������T*��z�>==������3''�`0,yzTT����R�doU`�hmmuuu-,,d���F���7����'n�<���KQ��c�222����Z���y}s�q�����o�j����0�o���2�����===���&''===���M����EQT}}�yV6���B����E"EQ������-����)����Z�f�jhhpqqY�MXc���n�����zzz�Zmww���;EQL7E@@@rr2y����<W���<y2((�^������R��d2��>q������|[�]���E�x�����'HII�(�����)++�(*33����y,����)KJJ(�6�������B||������Gjj�V�����|���c�<<<����j5��R�����777������l�FC�k0�����=���������d:/��wwwwss		���f�E��yyy111���/^\mE.]�����X$f����cbb�o������]m	�Z�@ ���wqq9q�DVV���Ir�<55������-<<|jj�>����Txx������OZZ�\._2��SL���m������������$+���2;;��Z����o�g������������'8�<EQ���L
�L`>?'77�<����p8������c���"ldGq���+W�������Q������s��q�������z���F.]�������u��	���s999E�>}:77733��������J���x�"EQ/^dvt���������YYY>>>E1y��i===���������V\�F�������Xc��w���;444///44�����;���K�����������]�r���s��1�S�V�>}�����

"+~�y���;000///88�lGfk�������377������������#""(�:q������k��5��Lb]�[{OOO��
S�P(�������CCC�������H�8..����G��gdGq��UR^�!�[Rlkk�(*""�9>"�H(�rww'E�C��8::JQ���).�y��t>>>R���,,,x{{s82~�L������W���IQ�j-����oZZY#��������+.�P(4}Wi�.((0]q�������2�5�����z�����E����N`��%o��E�c�TVV��j�i�,Ucc��kq��%o��N���LQy���@Q��9::������111A�t__EQB���K�QH$R����fnnnI��O���omm5��4�
Mo3��i����&M����E]�~��>>>~�����<v��N�[��k/�/�����R�d��j���r�\��K��L&�(������f�����<800��Rd:��S�5������'M���!�$��<H��.�Aurrbjfgg+++�����������3g��5z�>--�DX��
�X�����1�1�K,������������� �mii!�����BBB���fff��.��VTT��-VVVfzj��8u��y����Dfg�7[��k���KH�.���F& MT�&�J���<h�
j����`:��S�5����h4�5M��������%��$[����Vk�u�#�\�HEEE�����,?�6��;��kt:]`` ���x�����h��X���@�������H�Y��%�CBBV�
��F�Tz��5i�|q��]q	I����L@�$����5��o�rwqq1���)K�f���o�������+W�����e���]m��jy�.���_$�����L���XXX�p8L�w��*`#��<H��3��J��L��L"����{xxPu����{���Z�1'��M����(J,���._�����w�%$�IL�Z� ��LR����G4�z%���4M�W�^��<��[G��vuu�1Bg��Y�,�$�,M��S�/_����$%%;v���k�U���A�F���a��/#�__�%Y);;��A���_���d���x�k����	w�y<�j����i�<x��]q	ccc)�*))ajd2Yaa!�hMHH�(������Y��<h:����%%%���c�(�b��0��>���ihhXr���������z�����UUUEEFF2#�H�.//_�
CCC����8�6�{����_'����	�Fs��
��9?�t�]�xqzz�`0����4422B�tyy9EQ111Z��LO~�_�rE*�j4rOOO2���y�`0p8����5�q�EZ;�q}W\���aggg77���V�N'������������������u:���7����
�\�����sqq!��i2���X�L������A��r�y�>��B��p8NNNMMMd��V�8z� �/.,,�	\.7//��Y:v��i��0����N�2���Y�A�k4��ftvv.--%i�����j������Z]]M�.���3f��P(�\�����y�<H~�����Y{����w\��������l|>���`KK����_�@py����c����^bdd�t������/z[���s�0�����)���i�nlld_����[� =��=��3���
��Lo�e��`hoo/))���&�����I��t4


UVV766���������
2���~pp�T����������8T4<<\XXhz����H��w�L�^�5�P.�755WUU���-�`nn������tppp�%Y3���Piiimm����d���������v�^�dQ�/���r[G�P477���TTT�������;�ckL0??_WWW\\\WW���#�B!�{����*�-���r�^Fs!��f�e�uA������x����������{z���BXXXyy��������J����������R3q�������7o�8���455�4���dZy�����rO����7n,y)�@P\\LK$���������������Z�F�|��������������>z===�`���$11�h4j������[����}����W�������3��K���GGGK�����@jkk

d2�^����osssqqqtt�N�3�KLLT*�L�P(LHH ��L@��!����V�!83�rM�0E�N744=77�V���������vuu����{����uww+�J�^?33STT���nz�xr3������^�N�P(����]�u:]tt����4]]]���L�tiiiYY�B��h4


���d����z}FFF]])��r����o?<(�g�?���nkk�J�L�B�hmm5m�0�bq^^^oo�i}XXXuuuww7)���������y�!��bbb��V��� QUUE����(���������������]����m�F�133shh���������t	333����r�����cbbh�nmm�j��^��]�|�<^���������~�������D�������N�����JR��h���%��%&''���h�������5�WXX���(�y(�322L����������oI������J�GT(����4]PP 
Ie���%�'��������������������d�_���L��!�����M3�����			��������,lnn.!!��NOO�N[[���WWlVUU��@�N��"��4�H��������y������uI%9����c���V�{zzbccI�,XRRR]]]RR�s���r�
������f���������%$$��bR���[r|���6...::��ov�z��|���%��`J�2%%�t�������M��������*++kkkc�K�\AA���M�������y����^� 3H&<<<&&&??_"��>],���1	��O�����{b4

233������


"""��666VUU��]TT��iJ"���������^�V'''gee���`
�"--���d���������������%c8SRR���4���UZZ:??k4�����!��/���WZq�D"���o4���ZZZ����GGG�����������^SS����z���������R�D"�W
�211��%H���` G���#9����a�b$���������6rb��ypaa!::z����A�N�|��V����[�&LNN��H������^VVVrr�i���xsssL����:Z��4-�S'&''ccce2�����Pzz��vPKK�3�����VJ�2&&f�����L#�zRSSy<�]1��z###III���3G${x<���7�y����b�'��4IIIK:]���RSS���L��666
��������d�j��������'q�9���.?E}zz:>>��}���EEE���q�3��!������XG&�444DDD�cgK��������K��`�}���ArVEvvvdddxxxbbbkk���4���������SSS�i�4����M����t]]]LL����srrfff�����������0}��(***��j�������?�?��������I������o��{��'�|���Kd��/>���w���/9>>n�����];p��C=��8p����t��o��duuu�>��X,�D{��mll��d�����5��A=��SyyyL���}������������H����{{{���>����'$$>|��tww?���VZ|���ji�f�`~~�������?9;;?��s<���GyD�V�	������N�<���JjT*��]������555����qWW��}�h���m�^�'����������p8�F��o��}�k���5����1K���L�2�j��;h�������%�G�	�r��1A�Tjccc��`����>��C������_�i�������W^������z��WIM~~������L���{�������4�w�^GG�������:�N��������C���{������^��G�$fffBCCO�<����T������I�RR#�CBB���+++��������5�y�CE��'�l<�=f^%��O��
q���Z�i�[E�R��k���+�j��
9������^I{��+\!'�F[����ai��zN�W[{]�J�~�+�\�����������p�����lc����/p����~k�+�R*�"����ka$y�#"���B��l���`�=��K���q��gc��&F���K5��Z��:xh�t���+�\(.>��h�8'��X��=u�d`�1����\!�!��.Rb�(�
9��^k�.����Q�$6�&���4w�Rq����7����/�1q<'�Lq�pczAo�5��1��G��++�H�������y��%~����o�k����!������y�,�EN�(��sA��*z$�A���`=����}���G�������i�j�@>W�)��Qi*����2��7���H���� �D�{"UJ���}�7S�B��|��Z��[E��)����H	��}cK����p���Z�����������ssskL��:%IX��qi�����`.��V�h�3��|y�'�)�`PP�C=D�t��:t����666III4M����{����������
����s*)W��
9����e��&n~�����:�+**^{�5�8++��w��i�����������_|���v��`	�V����� 6n���d����z=��\���=y<33�g���mll�R)�����p�����r�2�YA='��-i�q$r���W��OL<~�8��!��z��?L���m��tww���^q�%��������V��	��n)q�����g�p���D�/�w���Z�n��kll�Ryo)&�����*���G�iz���j��k�����:uj��`���1��X�k�<U:o�� i��M8S��o����L�*$+8��X�����&����|>�����qoo����i���_��{m�������|2��Kg�I�R����NotJ��
9A���m	�N_.1W$y��j����R�}���
�O�
�L.��'�X,6�VYc���ypbb�?�Acc�L&����������������/2�������F$�8��d��5�"�������:+E�Li-
(�8[r�!F�X-3G���r��rX�4�G�����EJ���R�_�
9u�u��U�,{%���0yp��������w��]o�����8M�###�����;�x������&�XtmNg|�t��Lw�DSx�oh�W�@>[EA���&��Q��YS��� ������l��m�o���,���/
�M��,�K��Taw���4�y�m�)�,e�U�c������Sg�]�X�S�����Cj��2<�wk�	r���c����_�n�����]��5a��`��e��^[���`�oK�tNl���|�����}��F�s��E����
��B�5:�X��c)q����w�Ez��D��� �f�7������^��\g%q��6Gpg.��m�$�m��I�Zg��9����`��i��Vr��S���)3VO1����� �
�d����k�u,�D��F��s�0yp�� �`4�(��D>�"r����W&e"1K����#Qc��un)����$���]�g������h�n�
99Y-�7Y)�W�1�3��I 0�G��BN@�����jfP�q��R��v��t��Sy�TZS������q8�!����F�} c3:���G:E�}�j�]� Q��d�����:��&X0��$�ny�(����(�8��u�\��"��y6�������V�b�����j:W�����<�uy6�N=���6r�����9�t��bN��sR��r?��B������Q�Hn��j*n�f��V�r���������v�\��.Rb-��E�:�<EQ�b�%9�%��\�i��T2���"��sV�?#,���!�U�s�<����!0�I
_EK��1�s�� l%7���M��G�n�5	�GI�J��sFX,�`C��Z`R�G�L@��Bz!o.�Nf�=3���<�b��m����(���� ��k��r O,c�[%
���
9.I
��y�b0{'d	�q\!'�Fi}�l}�lB;����se'/Tqb{�,� ��g\��<c)��N�
9^Y��X�E��z������V`��@�79�?[r����O%��H/�����b�ve�tNNj@���������\!�9���g��U�1�`3�����:{�88���c|�g���4�9�|�,NjXo�� O�����Y/^m����s���
�����3�L�-��j�D��Z������*{��=�Z��������#��C\��V'�����~��b�������O��G��hF/���X����(��
�?C��s�8�!�gC���Ez�<����f5(�k�R4���q���k��i�����O�n1�v1��e���R���B���!
�<ANr�L�jX�)�M,���q�rA��}D�>�)kW�����N���y����"Y,���������X7,���F�u�G$fO����y6�)3N��y�����G�Q������P����-^mT�w��N�����O�9�e�xsL_)����y}x�y�n��I���?��kQ&V;Z&�^nz+V�����,s|0�^~�QAZ%u=�������Y����X7b����q]S���_C���z��}�!�4�,�Sj����>u����\l�7�Z���[�^@����>s�����WC����g��bpJg����U��*v�2���_ ���������M�������*�V�]�����7k�jT�)���iD X	�A��@4�����,��=3�4���bY+�Gs�A�]�$�N���x�-����V�m��~�("�Z lz�R�#o�.R��^r���WI�����V��Qx���Cl�g����Clq���j���XD X�A�M/�h�5��5��H����>z�$/�h[E��)�H��(I�u��wk�=� ������
9\!�.R�u��9�(o�'u��[5:m��qO�<���EJ�<h�,$6�9k������]��++�++y�X����������B9_(�G*Dl�@���"%d��]�$�q�F6����*�6+���j�����@�����W1C����V���[n���?�U�i�'���5V��!�����Vo����o
�,!ia���������EJz'�l�>r��,,�
��VwmX��x�j��\!�D.��c�����I/r����>�Xo�<`]����'<3����^x�8�=���EZ�+�E��"���V�� �-���h��[�B��r�Q������Q�C �� �I�TB;W�9[z���.�B�NdJ;E8��@���V$��1�5��i3����$O�St�vE�J�<`E�y�9~:�J�Z{����9[������_WY`��@l�@�"���!��d���|7�t\���M9(�
����V�cT+hU�q�;�B X�hF����r+Ox���M��ng�a���"�X#��H4�c����J��C@ �` X���A�A�*�<`E����y��D3:�������g�""V	�A���
��!uq�����N�D(��N1�Z���r��cN�!"V�A�5������W�vO+�
9A����=���bIP�7�p#X0g��y`5F#�?���d-rNl�Dyf��>��������q�v���n���@X!�V3$��������6����$
r��
�����i�W{�q����+>���, s�T�\b�,�Z�V�j���@ ����i�S9���d���/��N��&��B�;,�y`5
}*���R�`�@l�@XMC����a'~�������f
�A��4���8��h�Ab�� �j�T�8�Ab�� �j�������@�#�aC�$����FIZ�,�N�o��5�����R����'�d��������q�L�s���B�p�d+���r�;�*�$5�bj����Va�@�
J�2�'�]���`��N_8St$j�����_���"%����1������N����~��7<==i���7w�����O^�t�Lv���'�xb������/������`MU]J�H�W1CN��^��9~��&���@l�X'y�������5��"�����}�Q�X,�������(��>��c{{{�-/X_V���;��H3�
��Ep�|p�<[E^��}-Ab3�z������?�xzz���_]]���oo/y��G�������������g�}�z���� ?�Z�/�rI�K��E��}�z�������o���������s��x<��y��ZM&pvv���;y����+�Q�T�v���R��e4��g�p���\�Aq���`MMMrr2y����o�>���m����I�������}||8������o_�RcccM�5\����N�
9�������q1$�X0���Z���cM�666�������#���\.�9&(�Jmll���`e
r��Q^�}��Aq����aaa~�!y������/�4��[o	R��+�TVVVUU�����&??����z������8�Aq��������O?M�8p����������{�:::������tz�~����rtt��w/������5a��`%�C�izff&44����Leqq���oXX�T*%5b�8$$�������z��� Vb��A�{�� g�� "���A��0N�@����Aa��`%�a��5N�bq�LR=� ���@��
��
r����q28>�@  ��L4:��&�W��S���*��{�%�9�M�U�:Z����r�����V�*!�
��`z�1���Cl/9~w���+�|=�vQ|,��.R|mHc������<�P}SPq�����j)������v$r��2�v��������
�A0�+��gJ��B���r��!�=q&�A>:m������<�R0��Q���p4�����oR�z�uV�� �M��`!sL���Ke��9�@���`!s�}�����c��sD^��3U�Z�a���j�<�R0w�(�+��&�""�������Le��
qB�����@-W�Y���tS�u�B�<:�N�S�:�!sgJ�B�[�y�Vx��l}��}z����5]���|��*���?
V$T�m��ypk�Wj�j<�Or��[�k�oN69Q���}��9��l���c�{�V��#,U:-�>�c�X������o��q������������
��-H�2pKs|���"��"%~y�\!�++�ka�h)v�Mb`������D��
�_P���'f
�E���^0���1+7^����v���Z';_��r�[�"%n)U>�i��$y�X������g[&��(?�*D3����(S
��,� S��fn��^��r���W~*[�z��<�O�V��
9G�z�w��W������_U���H���W> �������K�j�4�V��%�n�?9�HC�CV��q^a������Z��[��h\�c|�]���a
Nr������|s����hF+�K���
�u�-��~�O����c���'f��r�Vo����
�7Z�������-20�+�������e��	��!�]��cz�m~����ra�y�4��-��?���n�m�o��brN/�7���uJ.S-v�M/�V��*��1o/�_����Y�aVn�m}�������Fg�S��?�+����y�������`�d��Vqb��iV"�J�R�Qi��������k���27��Ov�on�}�yp����M�F�NoL��f6j�{�bq�V�
�_lRu����w����b�����sr2G��\��,��^�������m����X�o�����H��V�l�z3���t����-A����3A�^�� ����R�y���#�`4���z����G�s��A���\Ti���:����y�QpMk���t��jz���u�(��h�%z���$f���<���g�8��y�&Kt7o���-\[<a-�Hm�M���7u
�z���~�)��<)�W�)Z�������tM�=���bj��ur<SUSs��9���|�K�M��o���S�_Of+E��-�t�����cR}���1fGb��R�8{���Tn�����TxC��WitF��8,Yl��Pk�,jt��^�g��W���f�h�S�7j�}>�Vo��Q���k5�����x��_�fyp�4�d���q���g�mH��3rN���#�Es`z5:c���S�g��9�����V�u����W������E�R6�8�����m�K7� 
0�M3NF��	/���T�~��6�8�j3������� ��rO�85�	����.��vl�Q0"�].�>S|�Lq���1�H	�p�t1����r��%��]!
��>��kL���)m�n��_o
���p/��cP�u������W����EJ8���/Ty~�����h���az�P��x��*��7F�G��:K����\��]��>j�';�XJ%�U.�p�++���GF�Cl�C��']�\r@l���6V��$,�4:�]����k�����#9�;7,Aw"�/J���A��Xq�9~�|#Lj{�324�`}YP�����w���W�5YM��o���V����3sVn�E�`=YP��E����	.�����m}���z2J����T�����&��W�o,(
�Q<���4'E�������<�����������%�X{q,�t�� l5��d�`�1'�<��F�6~s)�N���X)N�_�1�d,�Zw UJ[/r������i����R�	���N��"�zc4������\!�#��4���]��������K�����Ma)qO+�
9��$yyd�[�D�E��6�����Ri�1�$w�������*f�1���"�����Y����u�R�������V�#�J�O��R�)[E�iJ�\�$�L���-��+��)	<�=k�;A�h=RT�u��'�/�c��Z��s�8�;e��K
�%I^-�:����j[_����W�ot�i��G����q]>��������Z�Q�3�N��**5���j�����t�i��A���M��o��A�3�UX���4�8�`�����L�� �Vc:Ny��q2���5�4\�>R&����.�d�"<��\C��\*����,U?�x���9I$[ye��[<YF��-
k�+�R5�5A����H����X���������M/(q��;�~f
���Tg�XA]��?K[�������J�U�X���Wu���u+�Vk�+?����
R�!�R�n�[�6��g�^5��O^U�v��TLk�m �5R�4<<���������`Q:������:z����MRR����r***^{�5�8++��w���X�����'�gff���c�%�����s8�X��?�����	�7��f��v�\]]�c�J��#�X{�,���>|�<���������r&&&~��466�d��>���/���X��s�����k��7�|s||����!566���v_V����������~~~<O�P�;Se����F�133�������|32��������?u����D,��f�����555�������'O����;�%�����,���bqRRR@@�������+++u:�������f��j?����|�;/��BZZ���?OQ�O���g��8��?���W_�i���S���wrr������������l���o��[<==Y�{yy=��S/��Bjj��|�����������������������������������x__�?��?eee������'�vtt|�������������{��/^|��G?��#gggww�#G����+��/�"�J���e���	��{���*++�m�6<<L�������?��\������<����822B*?��cO�|��w����_��;bcc���A����_���Y��DDD���{F��������'�`k���O�����L�t``���~J*���X\l���-~�,�������/�|���Y��`���f������L��R�|���4
M�
�b���,�e��=b�����{laa�TRu��i�f���N�:���(22�h4���%v��)���c����'��9s�����;::.���d,nl��������4M���:88�J�B���}��Y`���,����������Y0}}}���e���I{{��~�#GG�_��W�����K��JeDD���-�sqqqy��wkjj�������Jeuu�SO=�����\h��������x��?���Glmm����j5)j��_��/��2�3���x���


�R��`P��}}}�����_�5l����}��]�p�����7�hoo���=r��g�}��,��w�-~�,��mmm������^�Z���ggg���{������e���LGG���FF�L&{���v�������8����+�|������}��w�y�����Y0�F#�����a�;200��o�8q�����>t�����s������666��m��{�������K�3���7o�������O���c����<�����J�bql�;��K�����������y��m��������[qqql�>a�-��tvv��g?��k������9R�n���`HHH

����t���={�������+l���z�!���9f��3���������|�;{�����/�r9�s
�O?�4�~���O=�TTT����m��'�Z-���\www@@@FF�B���o�{�n��[`555o��������������������/^\2��������?��s�w���}������mbb"�����h�vww�������������<W��@��^{���gaa!))�w��9���������_���/_���^zI$���<������l������Ul�Y���W�����'e2�T*���b�����{��������?^�Ptuu=��s,��?��CW�\y����|���agg��O>���@�\�pA�P����x(��x������������������������Y�4��;�����oww�J����R�4??������l�bs��


������788���fff�z�A������m�����>];v�`���]���>44������&''i����x��R�����,�b������_��Y��G��`���;w2G��J%�w��cy��;w2\\\�����ygFGG?���}��9r$77wtt������#+�V��7^��8{�,s8�L��c��4=55�w�^�fA���]�Vi���{�'N0��
���N~� Z���sb�9<������L������^�������ccc4M���2m�O>����Kl��(((x������e2��>�f�����'�|2??����3�F�Dd����^z)''�����~����T����5�z@����������'�x��Q������d��o'���>O���,BCC?��C�x```��=��w��]�fA��O<��������C!!!,�e��.]���0E������.::y��4�������={����*((`j���w�������������Mk��?�s���L'?��r���G������k�Y�w�y��ny���i����{������X�Kkk����w�y���~��;99�������oXl�Y`���h���?�����^HNN���~������/�x��bvv��g���o~�����3��:u�t��/ak4M������������������W_���D"aq.�`O%�J���gzbnn���^B�
***�B�iM}}}EE[�o4322Lk���Y�4�rMMM����,C{{{DD�b��d�ju~~���g}||kjj��a�7���???���F����|xx���oyy9���������D�����'O���p����*�^��,,��{*r��%��������,�,���_�|��7�,����}nX��������1�;
|���7��SX��k��z�)�^�gs����>���R�-���0��
endstream
endobj
26 0 obj
   16028
endobj
18 0 obj
<< /Length 27 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���wp��|_d�8�����g'�xbg��s������w,�Nb��\���3Qr�8N���d���H�)Q�b�X�N���� ��X��@� z�w�G���*i`�~����Kl����gw)
����������W�����������~��fs[[]��E�R���0y]RRb�����

�����899����l6��,�V���kIY������@ggg.�;77�j�V�����FXk�V{��-'''??���<�FC��h4fff���{zz����L&�������R����
��������,���
��������K�,�G���vz�l��b���s^^��S����g�&''��,vm�p8nnnYYY���z�~e���?������kM`6�#""8�WXX�������s~~~nn������������8Ncc#;��+���999eee����'�J���u�6K|+eee999����?�����imm�����y����������S�e5559;;����������xyy
�����~www�Cww�?>--��NKK;�<���!   88�~��R||<��Y�$=���VD"�����_���
��P(���wm\�O����Kll,���4��7o��p8���tMee%�����&E''���"��������������������f�|����kkk/\�������[TTd4�+�V_|2�\.OLLtww���HOO��������_wss���HMM�j��;�������/���:;;�?>77W��������NNN���V�t�hpp���"��KKK111���)))r��r������h///'''.�188h�}}}g������r��U7�����S�]V988�vWxx���6�/sssqqq����������y����������899YUZ	�r���2�LUUUAAA������!!!---V�baa!..��������z��w1Y������8�k���z�^_TT���OV*''��t�R�LOO���ruu�������<8??����������T*�&�����r���!�5::���������o��V�����Z��

��_...���f�y�-���/_�p8���t
�L�??\.7??������r���������T����������3+++33����|O����/^�q�F~~�����ILL�������������p8dz��H�S������999����p8�������p8���V���kRR�f~����tW~~>���p8t�&MKOO����������c��H����'33�������GBr��\%%%��]�p8��]#(��loowtttss�����������s3�E&�yxx899%''FEEq8�s��Y�� �u�`�
�������c����<�s��������l�{�~7��g��
+((#����
w1Y;OO���������R�_�}"""


n��A��=�Z���i������`oo���g���r�JAAAHH����X����>���(����������������������r-V��W�N�km��'����p8*���Q�TdEH����aaa�uXX=�-!!���>@����o%���J������|
Iq�y���[�8>>n����� R����p8QQQ���X��p����Z���h�s���J�����U*���`hh��` ��������L@��ccc��$%%Y��w���Z����������<==�����9::zzz�c���E X���7Qss3]CN��:������p8k� �\.i�����$������\�����f3i�����������g-;�B�������b�
H:��|>y[��DR��������6��d����
?���oooz�����g�r�\2�r��ofoZ-���6��N�����p�����ER���twwOHH����������(jhh�����
�
����2))����t��M�A����h$52���f�XZZjoo_��I���=Et�hOO�&���Y�5�������+'6��__��"�h���Z\�J�Zdd��[EFFr8����
�Kcc#i�����5�@*�Zf������������H�����	`��tM~~>���%H���d�-'#-t��j�B��p���,'�<�4<<�ypdd��!����`�	,��~� ��a��(�����p8�o�^u�7�7�`y�"�\<�����H�,..
������%���x�bPP��d2�����WU��V�5V'6�-(������h�������P��B[k!]\\8�e����Z��m&Z^���jI���NHH$�~���h�8,��^�KHW$io��_t:i��EZZZOO��.�����e�
�Nk����lGG������J�Db���G3��]���,�H�����J& MT���F�y�<h�d+���YN`�/����]Uee������i�H�o��cyU�^���;+�v.Y���ROO���V^U`_�f����Uo��	��1d����g�|~yy9�s[��I~�Zu�:���
W�IC�n���VGGG�������~�������� 9yG��#
�&����h4�����q��H�p8����m%�n++���~��H$���=�<yGG��������U���������w2������<��3����rV��
� �����k%�U�}���r-�������w�1�;yH�o��\��r�f�:WU��&����c���~
9yB7��j5�`�N!�c9t��� ��!�z��A��J����A�k�������c5�.--������of���z}__9�y��E��lUz��V}�k��UUUd/��<���
w��kG��X��Z!�,����� Zv�Z}6��o�������~��Z{��l�$�,-��K��_�����������ys��*�n��-�H���GD<p$?�-���Gk�]�z���LOO��ZX!����EMM�e2"����d2]�x��sr�<Hg��y�N��977G����`0��<Y���KSS��9�������k��$���:�(����*Q����CJ�����U��\"j9*X�P�����V2l�������Y��<h9������������
� �������=;;;KJJ��������������M��"��j�166f9�h��*��
� ��R||�B����YYY���o��M~�����t���.��Z�$7<ljj����Bf���y�������������L�2((���i�Z�D���H���U���r����$�I,�����p8qqqz�^�V[-$9�$%%��r�BA���/*����:::����7'�-�������91�L\.7==}��E��]����`2�&&&Hf����0n��W]���q'''WW���v�� ��g��lzz����������`0��s�~�uVd��o����������L2�f>�����4�n��!�Ju:�I���'����~3{�A��r9���x���2f�����
\���r�J``��������
6���C�		y�~Q��dy�B''���
r���J�����������2���dz����]]]�������+Er��]��@���:zr�1=��t[Y.�R�$�������k�7�/���VK���-J$�Z^�O#?�������zr<���������),x�]���jmm%[Zrr�e/A[[���+���������������ln����<�R��n�y��Yz�������d��[l�	���-� ...���Fz�-{P�wa�U[���������njj*//���6�V�b5�fjL&Swwwyyymm-9���]RRB�+�L&ww��nb��ZX!_^��<00P^^���hy#baaA(VTT�{������[u�Hf[[[YY����fnn����Y��f�yhh�����a�����/���������CCC�g��z�������W=�:>>^RRb9tvUf�yllL ���577[�Z]�)�x�]���R*�---d{��.��jkk+**FGG��)k�'�������\)b�O��%_u]�f���(���}�k��:{�Yn�u&XZZjhh(++khhX���PH�^�duU�������F7t_
���t�644�{A�F�100����;y��A��t��u�A�mgxx������8�'6448::Z�5e�[y������hoo�{����mM(����{)l��t[����F�������GEE�����������������*����x�@`Y3??������F��_����q���Ug������ZZZ,+o�����E�pWVVf��]�������������qqq���������:�n�����233������kkk��.�������)))f�Y��WVVF�UZZJ�BN�	�j}uuuAA���gffbcc�R�&�<������b�Ba4GGG������ZVVk0,�#""RRR�j5]#
���H�"�$qLMM�����xkz��Z����all,66V&�i���������}}}			������			���j��h4J$������L��T������A���R�JKK�3�
Cll���QU[[���JQTEEEee�J���tMMM���d����FcVV�
�R������7���a��|��ymmmGG�T*����+U*U{{��O6OII�D������������h6�kjjV�A�B����[�����A��������I>�O:���=99I�e|>�<$�f6�������,+{zz���-�0;;��������F�1..N.���b��zRo0�_�N^�UOr_||�����d�����s��������d2�����f6����I;����\jj*EQ��������������;E"QVV�e�Z5����

YU�ot��P(,((�(����~`������+}[[[SS]4�111����LZZ����a�}GS*������t��������������~�0<�L���DzH������[�6kjjHO��`����Ox��E�H��������alll\��<r6�
���k����j���I�,XjjjCCCjj*�s��������xVV��CCC��������D")X�����OHH������]��t2��q���lC�V��y���������K����F�?UVVvtt�%i����<=--M,o�����7��d"##���
�b����D���:a�������l��\\\������XYYY\\E����������bbb�>OKb�8>>~qqq�z�V������c��6T*UFFiI�H$yyy���������7o�%i������bii)>>�l6o����V���~����alll�����Z���f���X[[����\.OJJ�����k4���+**��{]]���,��fsaaaGG�F�IJJ����p��)���V-A��L&9��������Y�0z$�#�G�T���vtt�+���r�<66v����A���r��^�OHH��sss�����Y�����NKK��\|>_&�����j2�h�z���/�������W(�<������������6rd&�e��Vj�:..��Z���V���N=���|>������y��������W�����>���r����lVV��-"�����D"2Mjj�U�k___zzzBB���������"��������d�Z��������q���W^�������H�������R�������_��JeBB�H�������(r��*Z^����mu�m���,��*rss���###SRR���W����������N.������_�n9��4�����_����'�H��_\\�������|��\TTT]]����`sssO=�]��������'EQ333o�����y����p2��k��~����g?������<����_|��G�k^|����n�i>�K�P444<���"�hjj�������
����>:q��=��a=��st\\\|���233���H��C������7���IIIG�%5������w���E��SE�����G}�������/����)�z����Z-�������/  �����h4�����o
������������g�}���={��FR�����������rI��h|��GW����t��MOO�*�Y�<?H�j�{���(���aqq�T?~<$$������R����m��It�����������W^y������'EEE����^555���:�),,<|����t��d�?�<9?���/���P���p���3g�����/����`0��:r���3g<���l�5xp�:AB"���������eee���R����D���P�@`�E���M�|[����&#������	`�d)O�
���ker�lA5�l�m�����:�N1"��������H�G$����������^K�U�4�~Qg�tC�t�\����jf�2���+
����X���r��/������7��3^��� lI�C�b�/����6_��M��M������lqh���+
����J����A������`w����=���<�PlsO�MjA���v�X0R`4�-6MT��zfw�"�Vc2��f��vqA9���'������L���1�`�`�����X>�=(���W�*�Z�%��3R^�k����A�-(�)�'���$�5_��!Y�z8��ba*O��n
D�-h@|�r(7�����,����������LovF������<[S�HaN?����/��dw�2U���gw%5:&����e/��d������o������g�������9p,Z�v�t���-`�!;3��Y���*�l���A���%l?l��s�]���e-���f�_�+��NU1K9�J�bpVo�m	���l��������a#N���<� ���X��y�A�Hg��J�E�iR+g{���*vL������cf���<-�nR"�}����	�UC�r�L����3[�Z����������n-���v�Ir{X�h	�=K�XYVWS����s�)��I'b������&�����;��dw��5"�o/��w0UtI���Ud�RL������g5���v$��4��^��4Q�1]�1]�|�'�
F
�*��u�y0���w��p�T:O��|�m|kpL��Z��/:%w#����+c�/������g���������eij�2 � l}������J�X������)�a#>������^��<�kA9�:w��p?->+�oE�-����R��eWm�/�|��A�/f3���`�x,Z��&�e�f/��p_���>��.���c*�����b��`iIm�<n�����/�=�l��c����_K����H���z�����������2�l�����6�N�C�� n�<��#�������^:r�����RSS)�JNN~��'O�>��{����K&�i��X2%Mn��g��*��E����|�M:VWW����uNN�O�S��>\\\|�����W^���_u2���Aq��zazi�y��'��[�t�4N�F)��i+�T:�L-ai��;?�<�e�`dd��h�� ��;q�y-�H�x�	���R)�<v����WW��A��N�����9��<)����	�)��:��)��_k���l�-�fi��E$�<�����r�k����/}���={��L&R���~���U'�2==���n5�����G|}HB}hl���Z���C�*��-?�(����*�4����'c�o�!����[�=kt<����y��h{�1��8��jI���s``���0H��������������a<�ew���uF4OV��
'������G�%�:DQ��C����H�o����VN���i�i>��;��������V�d�c�����������P(>����>�����������
Ess���������0E�3�&������d/��{d���5��r������)��t����������o���P511���o�������[k2���h=�xB�On:�����QI5r6�yp�A�-��K���O��O�4�R�x,�y0�X��Z�_y����V�N�wK�I�����c�����������A��A�L�^XT��.U��z�*#�����13�G��<H����h"��d���9<!�z8_�]R�5#���r�����8�� �"�N�3��ryBn���m�NaoqX�9�Wk<\o
�{{yyv$��t�r�;7;X����A��E���>cJ�q���E�y�����/�<;��d�Y����Y���Dj���<�q2[-�`[h��������$u��/��)�����A��k%�<!�3��4_����Fx�In�/�w��8�-���-����	��>n�UQl�*V�d�` �������w��u�_�~y���1Nf�� l�j�[��8�3I��:�=��y�e6��4R�f��"�Z~��*O��f�n�<�q2[-��Af��5hxBnh��J�0��Lc�d����5�N)-�:b���F^�*�F�X�$��V��]:6v�6s��E��p`��k�W��r�Z2����*�������c��������EK:U}S��e�P9�`�X0��	&��hR��/W���8-��*�Ty��YFZ%I��v?>?p�8����c�)�qLd�$�)�qr���������]eZj�R��r�M�Y
����V�������|��+s����$m�

� <����f�Gf��q�&�Sq�d!+w��Y$�d���
-^dc_����)����ky��/����i�sR��/S�+�e5m�� ��E�J�yBnbc�z-����c�F�i���'b�l��[������hPf�w�*�t��u#��A������c��{�S�yR���V������j\[�^�X1�ti�k7��r��U�Od}`f��F]2O�
�=�T�'�^��?��w&��<x7[�Z������.u�@y�Y��\�cTK��C�-�?Kz�*�'�:&wl�<�ww }v�j\l�X06h��.
^����&������Wk���������&���bJK����V���y�T	{�*�f���4�����>m���d����-�gI�T���\���m�s�T����V����rbayl���^�����<I��c�H������b�x7�_�LRj�I5���{��T��S}wC����<[���'���`A�rG�G��!�s���Q�����eK���Q�!�A� P�N-�f�8i���<�g2���*����>�)/��7�����WV
}�M���y0�U��r�\��>�IQ��+���E��2yZ�j�ncmx�04g�����`qg� ��g6�3�yB�O~*IU>y7�-z�f|��'���9��d�r�bn�����	eF�jB��A�����l���J�S����b�E �x�j��T^
��(�d+��f����
��^9�����v�-�V���"�O��,���bF�j`����l��T��H��
�A��;]zw$����IVOHy�Kn��{}_��vx��[l�����,���yHw�yp�� ��w<�o���R!E�����g�v��Y�����2�������3�z�vg� ��w,Z������w,Zl�T�^d6*������;��iV����y���:���v_�m��;��h1���v��iu�CYZ��}��d�T �x;&�4+kz56��v`���
�A��r({����N�|q�����S`L�U]j\��v`��V�A���`2�,�+urR���S���.i�����;���/wZ
z5}S��)}Q���bm�vg\��q2[-���h2��^�	�����Q�*��Mf���t�kUe7)s�����bf���SM?2�oJ�1��50^���;�O�S���V�iZ�M������n�@���;��r
�x���h�GV��&�Wk�N)�g�j�5��67�����m#�s�IVcra9Z��E��F�d7��_L�n/�0�������@�/a������b.U^8=G�AALJ��g��E�-MI�3�=���{I��������3�,.�{��+�-y��L&�K��
=p,Z���@wZ":��h^ftL��	��*O����%l7Er%_64�C��!�� �MI�a�W�\�XHo��
�5y���$�����r��3*X����U���E V	�A;����w�����`B�p�-��vd�<����Y����.#�� �MI�8����)����{Y3U��b=�j ���8�4���^��I(K:�3z���"�E ��A;�'��.�����z��
���<`G��ID 6�A;BD ���vd9Ny��K ���8�A�.�<`G+�� � ��� a�@X���k����������?3��#�[�%��cf�{� �:Lfsn��rUO���K'
�����}��E�_��h��y`����'����W�'��#���b�r�x�lV���@�!����4��w���t�$��>;u&����I�(����{� �Z�G���xd�8��|+��/��/�c�r0U�L�$�CZ�
����ki�xd����������1]�@y�Y5*2��L'�D ���ki�p3�yB�QLx����04g����,";� �Z��4�b�N�N��
/]����@���kY��_\��<�@��@XK�������,�$�FD vj ���?���X���E��^-��^<��\�L�#Ur��O�����^�<�@��@����K}�����<!�\~2�E����U~'����*�[�<����x"FD��[z�Q�[����I��I��g��#;5�a�*������\u�-����F\�[�������@ X
�A�����>y7yB�gN�J^��I�Qf7��Ftm#:��3����A������i����)m�
�C5�x�@�m*�IIw]""��A���#�<�TV��3;�R���r�A���<�T��8�A���E�����<==)����y���8��3�������]����O8p�g?��������	�d#�E���/����mY��]
��������DSSSlnnV(}���'���`8?�@ ����z����L??���Ry�����A��7��MrrrRR���GIM�w��]�-2�_V��D�4� �x��
y�����G��������/��|��{�1�VK&prr���pqq!5�f���v]j���L�H
��m���wfT���Y}�d'#E~���j�K������F�A����`]]]ZZy�������R�g���H*���}||����\.�1��>�������n���������^�Z��rO'���[p�'#������p�gO�G���U���16%�a���V���w/EQ��������!!!<�>'(�J�����%��9q�$��U~<!�T� )^x3U�-HpJi=����g�c+�#"">��C������W^�(�'?�IQQ�|���AMM����Nj
>l�E�
��*�RZ��'���c�O�N���l����XqE����#�}c+�A�L�������/��bII	EQ			<s�������K/��x���#G��9s��������^pXSD�����<!�-��;]xk�����[%����K�@ �ul�<HQ�D"	����+���|}}#""�R)��D��������~-�](���\����>��?��Vl�<;Lh�����S���1���b+� �!�XF��CD [9��
��2z�� ��h�s��W
���$�q���+��e B�e�8�A���45
�*����:�Zgnf���.?
W
�!��@6`����1����ax��x�Q��o3���Le�1���w�
�ni����!�!��@6`���������J�4���8#e��6j�������\NR%�
�}t��L�-za?����y��q2[-��
��d6W����2(���{,^*��UvM,7{�uz���bZ���WTm�FV?�
����a�@6`��V�#�������N���8&��d�y��F�rN���3X\2*4f��lds-�wa �f:�Y,��-���4
����T�J���j_����V�<���\0��1��-�~}Y���*V?N�K�E��Ur���uuZ���"����4�8?��ypw�����G�]o
yBn`i(��d�(
,�v�������a�
>�lG�����o�u-���{�r5�wO�����}���W�0Z�[4.�M�~O��)������y����yp/��fI�r���	���0^<;��V���0�#����\��X�/��K&�����t������r���`6iD�<N����d*z���+2X�=��t6��](^ �T�_|�D��I��Gf�KjI^�Sa���w������Fs~��d�Fx�^�*����K�������6��d�_��*G���SE���WWT���t'9b�@���nJ��q�$��&-�'Kn��W(k��=�wf-#�9�&���`�������Hv�.�N�T1�U7/[����9�Qj��Te������
V�w��U�2b�b�����R��e]��)[�r��]�5Y������t"v��4��Z/R��/�����{���������W��F��$U���2Y���r\����Y�e�U;0c��b��}�uz�,��u�������<��&K��K���#�M����*��w���`No���J��WQ:f���k�c4Ql;F�
�oi~���#W
�'�5LO%hTZ�����9��.�S�����1f��a>|�gr�(W�;���h�55���6�Z=��s.��E����{���1Ci�E��]�5YXv�,�$t����:�w��}�f�/m8���(\5��L��(�Ti2��-`�Da>����e%�_@��i`6S��o��+�����r��j����k��q2��l�<X{G�����J��tLl�M��9ECV���'�j{���%S�?�R�U���,��T��N"%�
.)��v]<�Jv�3N��p�U_�a�;�l#68?�����}�R%S������

����A�M{�
;&N, �N�7�
�_\B��s'��R���a{��8A��W�M��wP����be������EZ��Dka�����IT�����}X���|�����������8������'��Kj���T�o`�(��\<W��Bo4GU�k�]�Jy1O�T�-Uc�u`����,*Mw;gt�_<@|Fjb���6���~12�<�9?�3i���v$��<!Y�����$�����4SE�����Cl���#�o��>�&���!���:��,�>������c��T�;��{"F�1��$�}��=�k��HqRb�T�=rAI�`3�G�wb;,8�rO�N-�&�����/������������.L����:3SE�~y<��<����'CJz,]���r�H#�X��#����{F�I�����9�Q�3�Z�p�`�~it��1s.�
�)��c�^7���R���8{S�T�|�����W��h��"W�N%����/����hH���l��2X�R��`���a-gY��w�b�����j����3�����`����]��j.���u*\�[�\m������������`Y9N&�Q�+�6
���S��!C`��L�-�r��s�������9��,��'Y`����LH���X��qM�PU�c����
ja���X��!y`��uc�;��&����������������,;=�<[��H�z�������;��;�P�YF��
rO�sy�n�����h
�K���~��<!���[��B�U��%�Wk���/�Z�����9�V��m��I.i������T�T�����c�bA�����7Kj�crg@I�O�Mz4K�uSf��
������+yBnPE�[�$ Gv4T�IX�W���r������1��W7���?��$�k����CJt%]�W�[P�����)��{ql
yv3�~Q{/�M�O��q2�����=���n�<��E`7�8����-N���-�DK��qAa�u���b�p~����*����w�y�������]�d�u��0^�r}���H^y�/�]%��!����ZE��[��LR7�V���-:%�����Lx8 l=y��O#N�-x�I<�$g��-�5ZR��6��`+ZT���Lb������I�A`�T*���������������`x����9r��i���T{/��TWW����uNN�O�S{/���x�'N�����'���������%��F�������DDD���EDD�<�=����������������Nrr���G�����C��{�lgvv��_�zss�B����O>��3{/�M]�t����������gff��8�Rss���QQ�ok0�������*???>��R��������}O��������������L&S}}}HH���O```rr�����o��=���_WWG.n
���`vV��7	{|���"�(55����>>>!!!��`00��VX��;�^��������/������x���9������� ���������E:t������������---15��������������s�=w�����������ej2����^����c�����;;;����upp�v�S����������H��o����g��y���###�����a�o�
���k������7NNN����������_�E*�25��w>���{���r�@�g����q��&''�z�)����|���������	R��G1x��+���w���>�������COOi���/			a�;��{���f�������~��Y�`�_�p�O�EQW�\����I����k�=�y���g�=��o|���������g�7�m��;������EQj���G��tE�T�}��18�'�xB$Q���O��rR��p.\���,C``�7�����h����G���o�B� �M&�����/2;#�3g�XU*
w�
�xpp�_�������N�:E*U*�W��U�f�=~_���$������9s�����Y�����tww���<s�����7�x#<<\�VGEE>|���8;;���?���vttT������=���� �s�(jxx��w�y���~���1�9|����g�Z-)������?����Quu��O>Y\\,�JM&�V����������Y�`�OOO?���W�^���{�������?��'�05�����7�{�����}�����V�5����EEE,++cj�9��0===���YYY
�����������>22��,L&SXX�k�����}�K_����?�������28��l����GFFF�z��s���5G����fp.����vpp��g��:������Lf�
���;w~���=��3{����o�w��WWW�F��,h����$�����O?����C�=���={~���$$$0���
�8��������`������!��H%��&�))))88X&����?������|��
��U�<�����,�q�sss������/��'������J%�s
��?�<�~��}�=�\LL����g��s��z=���R�������T*�/��0�������{�������������������?�v������433s���^x����>��������SRR��
��HEQ��������o|���h������z������r�<55�W��9������S?���������������������w�����V�f�������,�f����P(�R���}gxF�������_����*������^`����G�q���������R6���}��gN�:EN�\�zU�R���1x*��y��������9����_��_^^^L����w�}������5��h�J����L�bg�����}��������?�0;;��#������={N�:���k���ts�����?�������������(�������WI����������H$:z�������#����`�f��o}�H�V3�t��{��7��o=���������Y�-399��G=�����������d��g��}dE�Z-{�E��EPP�8�B���OR5??��A�fAQ����Wi���G�s�����M&���~J~� ����}a�z����n���������<y��<8==MQT{{;������,���������
����.��������������/��F���"�G�����yyyE���;���������eG�C��2333W�^}����~�iGM���9rD�PTWW?����644���^6�EXX��~H^���<����W���fAQ��O?���bUy�����P���T���|�]4�L����bcc��N������w�%((����+..�kbcc��������������-k~��_:t���g�R�<}�����m�>��������w����^~�e��


�����`������o}��w������o}�[�������/~�k68n�t�������������������G?������,�������ptt��w�H:���?35��BCC<x��Q.�{����_���Cb������#�T*}��g-OC�d�W_}yp7���
��5������L���l��������d��p+���0>2�f�0�L���QQQdL��h������� oo�+W����1;0����R]]�����Ar��bii)22��������$''�L&f�255���������jjj�F#�����T�.�V��S*�aaa�l����Z�z��w�,"""��6
lonn��_,,,�,6t���7�|s;�<�utvv>��s���gg����>���R�]��G~��
endstream
endobj
27 0 obj
   17788
endobj
19 0 obj
<< /Length 28 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yp��?~6�GH����YHHQ��l��
a�H�
H�f�
����e!$�Bv�=�G�%��d,�$�>F�����kt�����F����W��4���v������?��jM_3�L��n��h4������h��+++���%%%��eff&..������������L���k~�0���t�s��D###�uII	���pA���hv2rrr�����`��I�,����02�


~~~���\.w~~~�!�Yvv��+a�T*��7<<<����������J%�W�N���������������M�=&&���M�P0�(�M^^����P(4��h���6L}}}������w����/�4������-
�B{{����m�4.\HNN��I�<844��p������rss5��!���l�u�����+a�CXX���-$$��UVV������0??���ZVVf��}}}�����������������J�R����B�����p���$�J����rvv�p8������-v�fK|�eee���-,,l=���4��ikk���?�y���������l1d�jnn����b%l1B?��qss�h4�����Fww��������iii>>>��j�Z__������v8P���9������������sDD��l"8N\\�����������q3��.�����'>>~��***8�T*���?�yp�6��V�+�����d������_�p8�������J����M�vvvEEE�uQQ���=fyy9�����`g��A�����uuu�.]������,**�������|���$11������%==]��,,,DFF:99�������T*�
Eaa������������Onn�Z��l&WVV������Mf��������g���9�����WY������GGG����Db<���`tt��������
2y����.���\�r����x}����6�


u�-44�xZ�n���y������_����b�,AAAvvv&M���^�re�?��������{{{gg������V�U����|[DD���,=����tqqq��]#�5MQQ���7Y�����I2�,==������1<<|aa�����Bxx������{FF�L&3��_����m�~�������pgggGG������~�74Y�m���,mfxxx��rpp���4����G����9���=�t&��.����O^���r�\�zee����>T0?�����puu������trr"_C2�����TTT~~�����ILLtuu����������p8d|�NG�2������999/^�p8[������p8�����b1�E�0��X]]-++#�����z����\�-??���q8___:�wuuq8WW����������c��H����#33�������{B�2^�%%%��]�p8��]#;(�������urr�����������k'N,������%''�C��/�!�u555���ZmooO��L���q8�K�.���ggg����Y�.�������UGr����t�������������������DEE��EgO�Ju��%rZ��� 00��������\�RPP@z�����Il�u�p���(�������l���2����������~)6��a)�X7[c[�@NC��rz�\.'B����!!!�uHH]������r����)�K��rE"���C�e���<x��
����0��MNN�}ivvvr8����\�P(�p8�����$9��bL���B&..�d?��n����j�y����� #�}�q�jRR�q_y�����o�a�d���j������������������+�'o��jjj�{������BN��������r8��� �\.� ���)����e��d+���@zo��g�,���{	H_7�I�(����x��8>�O�V����q�y0,,�V������_�m�E�Z��������$���\.��\��;��&�t��&+�nG����p8���z���P$���)gg������x���9�����9�@ �v��C�tIII��Zm���a����t:2D,��pVWW;::�������:�6����Yoo�F���d�T����IW���w����q8��W��Y��NNNFFF�_�����v���P�������*<<���twwo�����������f�"��8s�������D~�G��JJJ�Lsii����O������3@�.11�x4r�n���R)��	0��4����=�A�J*�4��_�6���I�����L�s���
�}'[�A��I����T[[[z���JMMMmm���*�3�����z�N���AN�lxU{������b����<h�;�1�F���J�*--
&})��f��mvq�<H���O����-~E�y0��I�R�c.z���HBB�����kr�K�&���� �
6���tE����7�Z�&�_��"--������I�t�
������8Zloo'+���688���ryy�d1��f&��g�>Gf<ER����#��N�J�=�A�w k����x���<H���*++7\��lM�Y2�� �:�W�h4���9�%�TZZ���:���*X�������Ig�����jI���|~yy9�R���G~jn�q�E G+&)���p�C�
w��H�>4&�mmmy<^VVVGG��7�^����;��n��o��t:]kkkTT}�������I��	r��u����bnn���y[[�������~6\:�v��������s��k�n���������5L�|���d�o��6\�m������N�V�-���D����p�H$\.�>����
��0����o=��


���B� #l�������p6�����@N��s����-��dH�
a�sH�����#����������`cc#��1)�KKK�p8


;�p4�F���GV���?=��v�����>������*��H%��yp����KG�I��ZM��,������{���]�&��m�������������b��i��?$�,�_�������Ijj����7����m��NI"}N�.x��<H~�����6���z�*�������"���s8���Xz�%�HH�f7�"�`|G���Z�dD�G�N3�^���o�9�E�S��<H�Q����'E/t��j�d?O����������^������~R����8uHQ����I�����KJ����~6\:r
�q��T*-))�������Zzrf�n��q�9��KMM%�m���A�e]]]��=���JJJ����������{����MWF�dmR6F���WmqU�l��������R�D"���"���9��y������S���n���mv�G�g�����E�h4�����!///;;�M���n�*�<���h����ggg{{{:o�~9�J������H���U��������"�$��TUU��V�F�P(L��F���$�T*%;����-��~s�\��rmmm[[[���s������u�����z.����������]�������'''If����6n��p�&&&���;::�Z���"�p��������9;;wwwk�����m�X��+��)��[����������m�;���:EEE�D"�ZMn����J>!�����d���A�DB>�<�����,999�/K���W�\���3>m��Uf�m���$�]DPP�}����z�����UTT�]��E��Mj#�]��er%{hh�w�%�$''��;;;��u�����������$���
���7L��'��6���xAd2�\�B_�������5),	4�)��jRJD����|q3����iuuu��
���3��6mkk#;[Zrr�q7B{{���#�����m���W����NNN�5�;�:�$��r�[q^�p�.R�p�����v�m6BKK������x��H��q*�.����<���&g
�k�����������[�������;��������������r�3r(��^����vvv���.��CCCUUU���###[���/��`,//ojj2�����$***H��R�,))!e~.8��loo/++#�u&����WWW�B��b0���+o3��n������������a��Z�������g�u211QRRb\:�!��0>>^SSSVV���b|ju��0���?[lS�L���J���]�b�������blll���z�����������"��:���
��`0����M���i�G���o�5����b������������������%c&WU�	r�o��QF����^G:E-=#�#:�������K��>8yP��_���Oc�{622bkk��!����������f~��e����z��G��utt�w+��:���	���K��Y��n[��<��NW__���#""���'&&���%IXXXUU���������!�������f$222##c```�I�m��������V��QQQYYY�veee���%�������k�PXZZ�������V��O}~~>333""">>����<�bpp����dL�P���b04Meee�m����!������.((0~���lll�H$�������P\\,�Ju:������~������bcc�Z��������������$����I������>��s�����V������*�*11qvv��k___BBy����XBBB�B���t���������&�!���qqqCCCZ�V.�����G�k����X�G�������QUQQQYY)���jusssnn.a��:�.++�~��L&���###;X�p��|���������S$�����ryGG��.OIIY\\,((2VWW���O��!11���v}�I�������@�y$jkkI>���������������SSS������g��Cvv�����������r�9���&�������N�����H$�����j������f�I����������x<.��TUU��������t��C~~>921??���JQ���P~~�����������������,���av+--6x�y�����E�l1�[b}�oooonn��:�.&&�����MKK��422B��d2Y]]�q&�b���|RRRKKKnn.{��-�����$�s���Dp:;;o�����`mm-�	�j�111�	/���!9jnnnkk�6655�.9h������`P�T���������Xjjjcccjj*�s�����������,z���������%%%-..�fAA������������X���-��N���(��V0�Bq��u��skkkZZZll,��3�����TYY���I������b�����4�P�mlhh��<H�������
�B�_\\��������:1������MMM�������_[ZZjkk��]LL��iL(������l;\�R�������F0�\���A*-i���aaayyy�����I
�������4���WQQ���o0���EEE��C��_t=�w�p�����g|��0>>���>66&�H����?�������t���������^__�a���p��PXX����T*�������](`������ EQz����KJJZ���,�������H���d������������D"���]��yP������h4			&+a~~��B���>��z999###iii�������b�Y]]M��6N-�K'�������R����o||<33s�qP{{;�3����I+�Bgr�D[[��H����t>�OzG7�������djj��+�����������&��������,..�����E2Njj�I�k___zzzBB���������"��ccc����v���������q����_�������H�������R���u�]~I��d			���N=������I4��=;;��������[$WU���FGG������tttlX����������N.�0�V�###�kJ��tccc\\\ddd^^�����WVVbcc������`0UWWo�h�����c�=F7?�������JQ�������~���'�x"44��v�������c���/fgg-7�������g�}���!�>�lww��8�&�J~��������������H�������3��w����SO�ypee���������jll$O�8144D^���{���III�N�"C�������[h���F��(������>�����';;�g�y���S��C�T*2�������������T*�=j�%�_t���OKK#�����|�I��:����@gggwww.�K��t�|p�{������������)���4�Ju��a�����VVV����O�x<���H$���2��0���aaa���.y�����/P��������|���jjjjkk_~�e2�������B3�:�����~��|��gKJJ(�JHH8~������~����{N���t�'N�<y������ONN���;r� ���������C,++����Dd���bpp���wMM�%f��[s��3�*��4{��l.Hg��@;��*��/����1�R$V�bZ�l�M�Zz6&Q�G��K3s{��F��G����3�^���Rl`x��v��g�mt���+����z��/�b���`���<7�7�t0��?�'�����f�f0�
XzA6�u+�'�&���&�����������lsx���
�������:a���po��tvE�D�R���x�����[�S�������������,�F���[�j�m������#��4%�m	(�&�������<�����b[��3oE#�.4�22���=�Z3R�2YMRS�����&_���u�`7�X��o/��	��!�3
�3
L5��j�{�����_v����k������7#b�/%����`�<�<� ���'�f��b/C��I�������2��:�RT�.Yz]����q�Z�Gvw,{I�3��M���(��Xz]��T2����`Dm����<(����=��<h����yPn�	{O\��`��z�m�$q&f�NUe�)uR��oZm�u	{�B������s	l����h���Y����W'�1�-�KeW/W^�[�$�y�W(f/	"�=��!2���s���bqA�Iy`�)�m��#_\$.��5MT0��+N�{D���F���$�oc�;y`���	��=������x�	6/W������f5�#[!�]���5x������������f_��W��y�[�
�A��tz]�LC|[`p�kZW�"/�����wa�{n:�`��5�s�t�b#�����,�������<7H��<�<p������F������w��sI`#N�,�y���� �K����iy-WKkFZ���i���Nt��-��:w�h�]���� ���5��pc[��%SK����>f�W��N���'F��Z<�!�-Z�&���'�\���oKl��l�=7��x6N^��� � ��)4�����:��� Rv��x����f|�U��k��kI�����)u�<���y`�)���L�J3�O������As�]��o���<�Ky��z�[ynh���.	�A����kn����%6��#A�K�ag�3���1�������A�A��b�-\U�VK"�0�v���R�H!d�9��[�7:f�<5���X<���@�M����/`H�=��_9��p���y�\+C�+�If���)�q��t��<F��D������


����Zoo���H�X��hLY�
�����G��G�#��m�&u��O�=�E��O1{",�x��Z��>��s'O�<w����Ujj*EQ����>���s��z����{N��o8�4:M�hAd�w���;e'7#�w����iF���-n�d+������f�����"y��w�y��W�<X]]��+���999?���)����...^��7^x�����
G`��RoB�������
�#���>j�p}��)a�<����<�����9C^///?��#EYYY�D"2������������ZE���~s�d�,��=?H�A.�K^�t��|�+E:tH�������.\�p4333������jq���sf�!f�E�#,�CK��{��<������@^+���z���c���T*2������o����?�v���	���l'A^����A��x099���S������'(�:q����0����>����0��[�I�r���KisHZ��^a#>�[I��s�n��sss���7[ZZ�R�G}��'�P������/�J�---VVV��������
�x�r�@?(�� EQ�/_>~����G_����Y��&''_}����?���!!!������r��
����1s��'p?�[�]d�wx��R� Cb[iV�,����mJ;���	�A�C���t�����\��h`T4�T�z���Z�+���G�N �"���
f�6x	����n�d�L�V��H�JI�[����=8?x�y������� ��fV�h�s]C�k8(�<{�^o����	�.Y�f����a���v'�o�k�`OhUy�y�sz�I��
t��4�d	5��FYq���S��,g���,K�r[Yt� �	�u��1sgb�mb���?�[b#���BKW�����_�9�N��U�*��u����!���>UB�,�y`
�ZB�E��Z< 
��#�����	��}T]��`{)�`O�WK.���-	�M�@�
�V/+lW�/j�K���
�l/� 0kfu�t0�k���f|�J`�O��OmB�
�X�vfpty���������	��7��$�s��=|F�aes�%��m��wC�4��G��B�`Q�"�V����)B�A`��r_�doXU�������.�_)i�R��T�!e�\��gqc6��H���l&��e]~�"�y�������I����N��'���^%�����p��f��Wy�\��6�� 7u�5X<����u���������`�,_u�*r�����~K��1y)��w��.��me���*O���h�����u7���dYMw�Fs[��7����f��T���X f<x���*\�~Q����Rg6����cT=8���0���6�gY�~�DU�2�F��,oQ����{�o��u������%:��m��q:f6��k��6yN�|hn���oZ�=���D�M���P�:�nU����V��ja�)U�-��=�;KPyi�N&�y�����v��Z/3g����<��;KD�� Z<���)u��6��S���A��Y"�Nyp7DE7� � �Rhdj���Vj��JSM���Nf����fYZ�,�����j����)&��+���6y�����e�� �A�7����nt�4*)m��6���<����`^�,�m���{R]vs�f&�7��j6����y���������_e����y� (�)��j���5��d+�L[�>E����]�X��&YR��m3��u�3��au���Kl�h��5c_��zx^�?�J�<E�+m����.��`��V���3*��b�j^�O�Om��zd�������5�2f�9-��Y���P �{>Y���g����%i��Y��./l_����d4�]�5~�o��#���15���D��@���#�.��.����d���rD�e�y�����7����{o_�M',�������M��'��\�h�y:-��M������d �{���ym���/��l6����%�@� ��g-<�9�a5��U��hT�;_o�7*�j6������ ��g\��^����iykr-U
�i�n*���l"�� ��g�e�{y0�Nv�^�;�C6��A�}�&z�<�G�Qr��{�<���K�{���6m��T{^�ua;��6&�=��7�<�==[�#+�%Kj��	�i�e��k�DO������D3���Lo�1�L]{��|j���4�ny���mx�w@E��������������M��|�Rj��sL�9�jV�=���fR�W4 �b XVF�O��Z�B��mr�Kf�����l�D��%-�w.B_�6��j���pr�Gy��T���q���7����gX���Ra��4������8a@�
q�@Q&)��������<`A����1s�{��:&-�W��P�v���w8�.�A�^�r�n���JV�`V��u��F 6�A�^���������X���r��m����4����?�����s��<��&/�DD 6�A�^��N�.q)�N��E�J��1�L��e4�qq�Q X�q��]�RB�l`F38���+[�jf5��q}�Q X�q��k�r����5Y����f��w<C 6�A2��qOY|��@�@� �:�A�"�<��e����y��u���,�2���O�%u2���Ey`k���<��8�d+nf)O��)g�Yr����<�@X ��P���.��	�^�q$[�d�\��x���6�6��d���;�� �fd*�Y��L�������3��mr���a�����Gl���)��s����@ l�����g!�D�+�p]t��{A ,���iVr�nz�'9�Wzg�d7��d9-2r�������:YE7��@X&�6�<�t�,�]�X .�P$���n*:�T�c*F�]���Y���vt���ey`3��J��J���]Z��=�iT�M��K'�Ma�@�L���L�������������"6y`3��J��y����<��:�Ab�� �f��d������o�(���9�N�T�n@�x]p�f2����~
�A��Z&�x�"��e����5~<�bvSM��N�h���A����r�Ab� �]]��Z��9��%��O��$�d�i-,�TX���@ ��A�����n9yk�E�T��Q��Yr:f���W�hB�����cpV[?���2�4O �U�%w�^�]��V���?�KW�����cV��iU[{�Rb�������nlQ+�g�v���Ge5��l�P#����Li2���5��5#��ZY�����bw��I��To0P����"��w���j��l�o.���� � �QY��3�Sg�#gb�3e�' -�	��%J!UX���
�����=e=�"���,s��\���*DD ��yYf�)����E)��q��������Ge5�<r�y�kN>� a������ip�P�g{A�a��j�9��x���5""	���,�<�9�:�A�"�v�^W��$+����@�a�x/H�+�
��9IJkOYQ� �P'�@X6P'��?���/���R5;;����;v��'�

%�]�v���?v��/~����YK�5X�d��d�����vww��6�T������/..NOO?~���E*�����g��������:��(��u���Eu2LYYYy���233�����'N

�����^rrrRR��S�����������e�������j^�8�@T��#^e\`3M��D�N&�y��@�N�)���>�����';;�g�y���S��C�T*2�������������T*�=j���m�w���r����T�������L5}KB�<�0����8h�:��������}}}O>�$EQ���td���������;��%Ct:��>���fffZa���q�q�'�^*��Te���������T�1��4�Sg+����Y�W�V�=eUu��I��J���Y���O�����R�>LQ������
x����� �G��DVVV�S�JB��oq�v)-���{�M��j�xh�@ ,��aPXX����K^wtt���E��g?+**"_z���������_~�),,�����,�6B��t��h��(���nEe��~@�H��[�;��Y��:������~��|��gKJJ�(�?~�����~�����j�:����'O�<��������-=����b�sF�{n�����>������@�����CBB|}}{zz��eee���aaa"��Y\\�����������������<�)�y����:�-��J�y�CZ����=S��q����f�8c���/_#{(p?�-��'��&w����y�s\k�����\�����Vtoz3��C��c�>�=���B�Q��>��J��6I�^�\������59S���?�V���m��@��f�S��P��wD�7hu��e]m�&�^�8t'!2�,�����x�N`3��N��G����k�����*}`���SS���"�M��=�!�0P'��}S'������y�����*{��7T��Z6��bUL��������&���d�����4F1��m
�&(��N�Ku2[�7u2����fu����k�R�������%75�P��&���$��4���+�������TsI�_�Z�#��u2[.����u2"�����7i|o���ul4����	�[���q�M��P�������c���f�� � �J,��f��&��LTv+*{d�}�L5Cj��+/����D���������
j�*v�-j������������3|��K2����\6#b��u2����sz��C�\����L�4�Mnf�M�����k��Nv"�J�A�<�}����������:����lb�l��K�����u2��[���=����|�����M�DO�����2�,w�.83���b����9�d`'DR�{��M(����/���j�Ti`�I_�*��Kx�b���a��-�N�����S����<���e���<p~��MY�R��p?��E���UI��Iq���_q�	��=�qX{����D��6�R�1�R!� ��1y�j�O�=�000�������%�H���n�3�{�S����<|ypN{�������b�i�����5F��d���VP������Q',qL^�*��-	=;�����.�RU�b��?���5��u��
�&�EQ��l�BoX��XU�=e�����
��b��R���1���;���-#Z�� X��m��\&�`slQg�XU���P',qL^�O���<(���� VdzL5������r�����K%U�_l��z��\�Zyd�������Ck=��
C��nd��N�����:`�~��"����A%��5Y:�EU�h�$�gx^�9��|q<�xS,��j�p~����e��y�uD�\�.��S��=�d�/WY7�5�-�R��
������X{�N�)�-(k���6wE�N���P',1[���D/W|n�����wb\��e�Y�^X�`�B���l�j���C���<y�L�Z]\��~��,��
���A���<l0[��8���:`���d��~�NXb�:�A���9�us���?=��.�Q��(�no����T�ry'�&��������e�arkOY����=X
�@�!�|��*o����w�d+��pf�����D/�&,U�j8I�kew�,]�PsjIg�����L�K'�> @�M��'���}^����D����T��@���%ai�6���VYE���^S�
�A�60�TiJ��p?P's�
���/N�-�����b���$]{bK�����`�>�&)����w# R��t��[N�KV�Y�0�Y�F�����"���	%����kvE�t�_�qH�\������m�����."Q�������v�K���	,R���x��r�A���V��=eW
U�6��6S4�^J���<8����Y����:z`�h���PU�h�h6�xp�(��c�R�SZ����(�t�y0�Vu�Q���`W�}<x�N��<�2kO�H�<���:���k�M]�wn+�T3�Ms6A�G�(�&�u2��K"��>��mfIi#M�\���Hv�:������4k|o����31��m$
�g�#�N��A�]E�����	�s���([�([
�)�	�6�rrm��������������
�����y����)�6����D/���Zzq�d0P>�"�h!'���(�6��$/��j��$������{��G�q��E�V��>��O"��I,���G���.�-����(�#]T�����'��+����J������"�8Ja���?�g����<7�"�r�J����S�A�|A�[��j.I��3��'��.�?��M��\y�� !�H<���(R���$e�	�����SxAEbK��Y!�Af\'�<�q�� 4&u2�����A8��� C�d����u2��IU�i��X��F�~��L�����4Q�p?/���^��^�\R�l������`X)��N����;��4��\���m���fq����
`���-'�.��d+N�Mf�N�56�B��e�Jo�e0U�!g�a�AE�%	���PW����L/k����&56�z��s����H������`�y0+�V��s��<y���sVVV�����#����~��W�����������#���xg��!����y�K���xxxp�\�Z��}�+_Y?NXX���XXX���=���qpp ��J�C=d�90����S�N��CCC'N��������7������T��G}��'��#��|������=�������Zzv����oo���"f�V�������UUU^^^|>_.�3;c��1���!;;��������	�������� ??������i��K[�������\���������$L`�������SSS}||<<<���jjj�Z-��o��-��h4�?���_��������������?`p*���/�LQ�����'lmm�����?����*S�p�����\]]����=��Sg�������n<]�&!��{�������qvv������oeeu��5�&a�-><<�O��O999������wO�:u�����~:<<��I`������a�_�v���~���������O�>��K/�����H$bj����'|>�����H$555�����(jjj���cp*_���H%�?��?NNN�����>�l�������O���XYY����������AAA�G"""�z�-��`<������gjf���.]����LQ��+W>��C2p||�����9l��3��������d2����f�7�y���������EQ
���P��E���#G�08�Gydqq���G}T"������KLMB�����}������6�~t�#G�H�R�Z��������N�������&�R)���[<00���+EQ!!!g��%�r����u�&�-~W���!3l�c������?�����I�g���tww���>��/��W^y%44T�PDDDX[[38{{��������������
��������bp*E��������������#���.\P�T���h~�����'?apB����>�hqq�H$���*�jxx����O�S�&a�->33���O^�z������^���^YY9}��G}��$h�����93lqkk�O>�dhhH�R�t�������������15	�������oo���,�T��[o=z���_epz�>$$���^��7����|���~��7�����`0��|+++��#����������!KKK'O�twwgp*������VVV�:v���'>��S�3�f�����x����9r�{������R�dp4l�ma���������|����z���CVVV?����z�[`W������~t���������d ��z�>)))00P,���>��#�������o��`����]g>??�����_��Wy��O>�D&�18�@���O����9��SO���08�C�]�xQ��0���������dee������7��c���&Q__��k����������}��_���|��5����4;;{���g�y���c>�������uJJ
��03��(�rvv��7���o}���f���\7�`��+�����.�HRSS����3��~���=������O?}��W_|�����������mmmLM�����M��I����_����W*��D"777����x��g�E�/�����.������y���x�����_|199��l�����O�={����z��\.		a�T�&1::����766����?>((����w�����S��(��7��������~�R���D"Qaa����kjj�����S?������ccc���nvv6�<hY����:{�,{������]]]?��������t��?����<EQ���^�J������05����S�N����+]Y��G���W��#G�3D
�����>|����#G��{{{ooo�&A�����������O�>}:??jj����MJ����T*��E��D@@�8�T����R���p��q�&AQ���G7��dp���=����"d�^����$�����B_X���{����n655�����}��ypff��������>

ejDqq��O?}���T��G��I���~������?�<}k���i�{d��/����GQ�o����B��7�1��O�kfvv����o�����?�`�Dll���'�Riuu��>H>`���^�e�I������������#�<B~q}�k_cjE=������&O�<��T���*44��w���z�����mll,�����jOOO�N�<��S��������#G�0��rpp���w�!��_�u��	���A2����s���w����7����7�|��n{���)�*((x����JGG�w���7�|�����������������<X3�~C�V���<v���������Gvvv?��O���&������}�����������=???�������IP|���S�N988p��3g�����'N�
�Ne��D"��O>i|B,������Auu�@ 0���T]]������,�!����n���V�+��6	�^���AjZ��LF�R���_�r�������W�z���^^^�;��2������pOO���*2$99Y��3;������$___www�W[[���������"wy5�5�L&		ap`�=�����0�}�?&���Q`~����>�laa!&��+W�����{��)l�������b�?�cE���;�J
���{�x
endstream
endobj
28 0 obj
   17917
endobj
8 0 obj
<< /Type /ObjStm
   /Length 29 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
29 0 obj
   16
endobj
30 0 obj
<< /Type /ObjStm
   /Length 33 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ak�@���s)U
���Z��T*�����%�����M�������8o��P�
L(��4J%��Q�#�_F���@�f�H�x����u����wl�=�%;�Jm���EF�?K��VN7'S���j<��q��u���9�5^��{�PS��\[/u`��3IrB)�R�N�H>��*�����7�q�����5��� 2%1�>���������+g�y��N�2z:�����o���:�W���;$K�4%oW�E�����d�����-�������A�m���b�3�(�oX�y�
endstream
endobj
33 0 obj
   298
endobj
34 0 obj
<< /Type /XRef
   /Length 139
   /Filter /FlateDecode
   /Size 35
   /W [1 3 2]
   /Root 32 0 R
   /Info 31 0 R
>>
stream
x�-�-
a����)�q��$��$F]����2�UF��`
n@��9X.���h���x@@����h�"w���Yx�K�=��"�!�_y]�l��s9WS}�����������O��~�o���m�"�Yi�l�Fmo
endstream
endobj
startxref
171656
%%EOF

numa-benchmark-xeon.pdfapplication/pdf; name=numa-benchmark-xeon.pdfDownload

%PDF-1.7
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x�-�1�0Cw�� �	��{�L�+D��*�p|�)�d��V
��R������k[����',�-|��yt�����1���x��	�DF�n3��%���?2?TT�-��
endstream
endobj
5 0 obj
   110
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /XObject << /x7 7 0 R >>
>>
endobj
7 0 obj
<< /Length 10 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Form
   /BBox [ 0 0 1754 1241 ]
   /Resources 9 0 R
>>
stream
x���Kn!����(���������Y%�$���LHR*V6���'7�W���l_f����� �����c��FY��#,N2��z�u��nq�*�,�E�<�bP�@R����.�7��qM�X�$��FD,]?D1y��"2�0�f�y0��A�J)4��_A5��B�*c��$�R�1xL��	}aUE$�d��gzm��Q�����o�����Vy]Z���g���
endstream
endobj
10 0 obj
   233
endobj
9 0 obj
<<
   /ExtGState <<
      /gs0 << /BM /Normal /SMask /None /CA 1.0 /ca 1.0 >>
      /a0 << /CA 1 /ca 1 >>
   >>
   /XObject << /x11 11 0 R /x12 12 0 R /x13 13 0 R /x14 14 0 R /x15 15 0 R /x16 16 0 R /x17 17 0 R /x18 18 0 R /x19 19 0 R >>
>>
endobj
11 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yp�}7�L�X���:�(u�V���R���q�&�	e'N�6G��q_������������D�"%��[2H�o��}�Hx_�}�$H�$�c�C>�!)������3�?�{v�����}v�p�������6��H$��������,j���iMvd�����^�z��Y�@077�e�S�j����\�~���[ ���LMM9�`2����/]����XXXh4��4Mggg�������l�
?66��������������]^^���n�����D�G}����CCCE�?>''����b�lny�����A��s��MNN��v��n���|A������������������t�����EI�R>:�_yy9EQl����i��Z[[��{�VVVR����M�S���������OLLX,������x��"""�
�����/Z,�B���CQs���+����qzz��+W�?n�Z���o����G{u�����(�V��O��up�{����Sl6�����3gfff�F��X^^^UUE���������jjj(����%���Wii)y\ZZ�����YUUEQ���W����F�INN����p�Bff��b������>������L&���PRRr���s���={���+f��L��l���7n�8{����OHH�\.w^������EFF����;w.44T�P0�"[������xoo�����u$<<����[N�f���Z\\����Y�T*���[C��RZZt�����/���m8����233/^�x�����������������������������tfp~�s���^���377���K:������|�^�����������(�


�������(�r>cH�0�@PTTD�xyy������"�d�w�����������@�����}}}���/]�DQTVV���i��	/))����|�2EQ�����B���]�VTT���{��E������F/,,�����0fC������������������OQS���������[[��l6�={���m��*1�����������o��M�Hj�����4M�NDDDqqqLLEQ~�!S=M&��k��y����[�n��?j����y�fqqqHHy�El_7�;�����\�p���K$���DEEQu��e��Q�{��-wKe2EQ������QQQ�{Zjj�F�a�����(J��3-z���'I���o��M��}���;�qRR�@ x�C
��F�9���$��������$�[�vttP��@YXX�(��������`��|�$��#oV�������j���h4???�@@�������������(��r���D����Ezd��o���l���-�P,;����������H]BB��6����G��4M���s�EQ����38?e������������;kr8d�d2���x��Ar�;::�����h1��^^^E�����lE1�?���|||���/\�0;;�p8���)�������A6#$�i�����lh����������3�����r�/��
[Z2��y��p8���)����b�ONN���������Z�[N�~����LKQQEQ����[��P(t���C��Ru��
���C���<FeZ���(��u����Oy�:���T*%�����l5a��fff.//�4=99y��
��=g���(�y�v����3g�0-������


����C~���7n�l6������9�-���]�lU��<�����-��d���

%�dj[[�B~�����������%����uuu�;,VSS�|j�^\�z�9�����D��������w�5$7;w�����:�B�������d2���gp~�����xw�f3��&�����������
u��cq��9�������������oFvc�+**|}}��m��v��[��[�V���7�V.!!����ft~�J�*((�r�
i?s�LVV)4��d�~hh��6�3�Q�V?���6��}�x`�\CRvKKK7���@�Y6�{�:�\������g�gp~��c�������4-��cbb�C���y�:{�7��on��A:��$�2s@���9�
4�@ `�j���v�G���|YHHs�`0�6[XX����p�EQ����������cN6{�:(�H(�R�T�����}|`�\C2���X�d���@���������^���
�=�e~~�#��'|w,K?#t������G�$g��5��0rF�y\+�4>::z�uHKK;�<S7���v�G��d�������/#{7�RAA3~�l9�������9�sO���Y^^NF/<LLHH��W}b�U��>��[�abb"EQ�m��`yy9�����BQTCC3��y�:�<���KKK#����)�b��0��>��������!9Z���������f#wKs��,))�(*11��


E��s��Djqmm��?8>>�<�h��*`�z�:���Ev:fgg�fsww7s��\G����-..�l���IR��m�jkk)�����X,d~��=&&F�V�������g�����1-��6�M dffn���Wi�:���n��^^^���koo�Z�*������mff����������j�0���8��)������?{�,�P��,111Q��j4�����g�>R|�wG���3g���r���V�9{�������������DGG�������|
����dH�������	��v����W�:����U�[=j��l�7o������&e�l�,�x2����������3f����7�
������:H��������*m_����akk+�3D"��	����s��1SKKK�:_�~��yr�����$�����G=.�x�N__��aB�n��\��D]]��u���ooow�A&�9�������0��_�#��#���
���o<d��f������jll$��zzz����ot<>>^___YY)����t���\]]p��?66F;::�������wa ��*//w�5�-m�J���a���w�5��tr���������>`�������������-�d�����������^gR�ZZZ���zzzh������|��<�����[[[������������`uuU*�VVV���ly�����D"����H$�_B,3_��m�����������l!'����:�S�������������������y���r\\�R�dZ�v{QQQ]]+)��#�B����97����a�nNMMeffFEE%''�F��������������g�X����\1�WsssYY�V��izll,..N�V?������?��d����?������75++kaaa��p�n.--%&&���Z�V�RO��T*-,,�h44MONN���X��m��WYY����:�JHH0L�������V�{{{�F�^���n��7<W&�eff�4=11����&�������mzz��)333mmm}}}�����S�T���:�n�JnSi�NLL��n�M��Q[[�������������WVV����������OLLl�5nkkc�cwwwII�L&Cx��������l������!��gWW����v{iiiKKKJJ�s�[]]MNN����J�


�],�D"�DRYY���@���r�H$����J�����f�����*�*%%%+++***))�����5������D�fll���m||�ynTT��fc��J����}jj*++�i������#����SSS�F�����4XYYIII1�L�cqq1>>^��ttt��������������f�������A����������$��L�;::JJJH����b�Keeegg��������|�����q8}}}"�haa�f�i����r��������.++����J�yyy�����������L233[[[��n��E"s�����|�0�"�H�R=��.��`0ddd8�������qqq�w�===yyy"��9��p8���ZZZ�����������%��y�F��;w�R$	�����m��+++III����H�d�����\2�fvv611���^*����������m����^QQAV >>��i��^Fx������,fO�XZZ���(,,���fgg��UGGGaa!�O�\I�jkkkiia�K�tdd���e�x�"B�tTT��!��t��,&�J���9�V��N�����2CI�l7111:������������x��_�e�~���n�$����������������tIII�`��n/))av�D"m�lhh�%���E"���������j&]ZZJOO�n9�5���U(�����b��F�����L������moo���_]]%���	�<�������7�lkk+..���sss�����-�nWNNN{{;�b4���GGG��������E���R�RSS�J��lNLL����Z��������D��A�^���0==M�3;n����������U��699���L����666��f�� �H�1�m�I����+..�r=q\��������t�\�����������V�
>�t�TI����yr]�p )))22255�9������Ov�����?d��:������lr!|OO��w�L�DQQQ�������Q�����FGG������3���k'��iiiccc[�� ���X,�������������IcXX��C�\\\�|�M�R��)��*��c�����<��s���|��q�����G?2���_����������L&�j������S����r�o�i�Z�Dr��A�J�_���k�>������j?���8����'N��
�����w������CCC������D���q���|�;������O~r�������M������;�����L&�����y?Gzz����}oo�_��W����j0����J���}����r�3�<C�4i����t��������533�~�������������{/**J(2��j������[Nuuu]^^&�'O�		����������������7o644=z�����xxx8�����_����4������S��\pp�7���3g���7�9|����<M���������tss#�^j��c���&%%���yzz���[G��Z�|w�a������/--��J���1up������2   ""B�V��	��l6����_��_~�3����~g�ZGXX��C�\\\�|�M�R��)��*��c�����<��s���<u����}��?�����R����������i777�L��j�}��S�N9�������j������*���<�W_}�����%%%�����B�x���8���}hh�4����"����x"���go�����������]s8������d��h�����[N=p���d"�^^^������q����EQ��������������L�iz��}��o9��g��i�4���\�ti��fff�����a�����������@�P��T�������o9���uyy�4�<y2$$��N<������_��			


G�%-%%%��o9���_/--%����j}}=��xLb�����g��y������o�t:������?������F���#�B7OMJJrss���|����9B.��D�v�Zxx���
iQ�T���AAA��S���p8*++"""�j5�x����+��zi�`������m������0�X�3+W�fU�Y����^��
�0Xt����%�t��~��o�}tI�A:���5���{�����#��zI��,����������&?�X�}�.��C�|����iqQ_���8iq�h@��m������EptI�4^�|�wN�p8,��d5�6��g)lf������8�X ����L�i��NI9�/uph�g@��b\�m������42����h���Y>_m����n���
-��.)����+����F���q�|d�_o���q^��s�&�7����p�6��Xh�����=�a�����O-�Y#"%����y������a�����E�X�xqd���R�qL��Z&k�bAfW�����������j����Lo������8N��`�guEgtFr�YV�:#��b���Y�������y�4�������t��^���UW�Ls����tp�G��Ti��~��2Y
����%�P,���X��s���^���j���Y��FI�:��q���z0���P,h�ib5]R�N��y�J{!z��B�B}�V#)LE})�����j������lPuU
����U� �P���R����,�I�Q�6�l5���d�O.t��F��N\F�@f����fjN3=��Q�@2b��4Nv����>:��X��&$��;��M�7k�����dm��F��~��������7�����.�`x��y�B:Q�A�2Y��%�L(�;��+��u$Puq�����\��~aV3���N.��������\�\f;����};�@ v$P���~�8���|�U�:{����'�XM�N��s�N� |r5�WF44�W���@ >��:;�h��&k��V�BZZ���Ni�u\=4���	uv��j
o�,f5Sj�����~�{�rubby��� �I�Axr6�mryX<ZZ�HgF\Wfq�"��� <9����'�@ /P�	)W'"�/�����aF�������x����v��Z?�&E ��AxB3�c����l�b�#���P,������
m�e5��i���k7�Fp����lv��Q���<9��[�Hg�9�(�����la;���TJ��u�p~v��������������������������B��t�n
�AxrF���O2���K�*S��Z���������n�Hk��h��B��~)R\FJ[yx��5P����:8��_�����M��y��y����Ki^o�������|��4R������q������n���}G��@�'�7��-�r�gDB:Y�G����W�bA~o"�]�-��� <��9Nfh��:X9��;�����"�L��%��v(�3���r�������u��'3���8V���W)L�$2:#�'*�'*��OV��6���%������ipx�e��.t�7�GI��N����u��^;?�3+�������]Q�m!�Q����6!)_�����K�����SEj�PN��#��v�� <9���bo�hi�P.IT������V���������`~��^�?���M$���N�`�N���
5��@�'��8����X�|Y2Q-���-� ���%�2b�e��?,�KYZ;4��~;Zz�����l�m3M�W�������'���$��Jj��l��z�[o&���"�N�*���"����/�|0���a�dtI�8V��+��"�v��������'#+�};�������`;-�K���wD��������F�����(i�P,��k%�e�����S�6��Q�7k9HWKF�a���|�T]=s�U�i���T�^6����`)2��E��Y�1�����bANw,gi�@F����{��2v�md�����j�Xm��REz�lK��|h�wp���?M(���YM���(I�+�����|)��:� <	�f5X���A�T]��JNw,y�YMc������wZ��;�m��&�<{���v��F^O���?>d��$���t\���,.�v�(i�s��r5].m�sy�P#X�����Q[��R��b�xE���q��U�V�7��M___����:t�����7�T*�f�<U�T;v���������O���������6��Ju�4�5��7_�;�'��R������/q2v>����.�t�|��������jkx��FzgD�Pn�R���
���B�l�Z��.�fy�h_�v����4�����nnn2�L�������N�r�s���]��j%���U*/�x���C�bA�L3�G���y�'�2�;��k����2;������~�|2v>�5����A��%�H�c������}�o�GFF�����w����)))'N� �
���?�<��S������H��o�-�8��SGmX��k���{���$�����(���Bxm5����&y����&��}������n��7bbb|�9���`ooo2�h4����y�-�8p�d2�F//���@�{�t�^���,�u)[��9�"�#��2��[��,����O������,�pG�,:�f��._�_f�{1�?������T��N7����&$$|������L���d*M����s����<�M��������K�433#�c�Z����G�dm�fX�Rs(�'���_{!��BFg����l�?���)�,>��}��+qv�t�����\qb�l��=%g���	WW�w�y����{�jjj�B!sNP�V���:���TWW���e�x������{�*U�%�?�Rr����
aM���+���J�J��.\�R��9��V�_��ua!�N�e��:��E~�������� X
��}�a�`CC���G�����������������W_}�����^<��'������/)�,����y<yg!����@�������<��'u)��"H"�^s:��7����������F����!���4M���?~�������z�Yw��IIInnn���o����#G�V+_�xz�5NfdQ�$}��p$����(0�/��q�r�����>.^�V_��.��/������2���p�T������ f�n������2   ""B�V���O#^���4����>���*ETp7r#�F�����. 8A*���O�+����z������������j��NciNw���������y���v�Xp�"��-��NO�!�H�,Z���8�:r!s��8��������*�Ax���i�L���6(�<�t��o�����V�R��]3��G�L�������I� <62N�T�Q�Ho��:%��f5�������)��%��]\F�\�%��H�V�B�Rg�#���3*i��N/� <6�a�g�5�3J(����_QO��`5�g2��W�I���f}���u�$�7���GL5=�;�`;n������]LW_�]�,�nq&eu�D�R����z3�+�������{���Y�L-���%zz�����YO�������w�j�Hu��^"�I�qg�$Zf3B%M��&:��o�mB=�A*������p��I���:X����6�"�q����fIP�'Pw��J��Z����K(dtFjL+���F���K�/J��:X�v�����z�X���� Mn���������A��9�E���O����u�Mw�^ _{e8H�;Pw��R��F�t����K)�����"KXM���%AB� �k��l�.�A�;e���������������@n���N/��_������YR=y)�&�^
���YKM�!��wo8���f���:���Q�K��2�����Q�_��7�V�H���p��� u���<_��m���f6���um�k'�����b�B����,?��:8���M[�w_����W���k���=D��@r��n)�iQ�.�YW�����Q ��;.��S�w��U4���Yy�L�uptIQ5�����:x.e�����06*�9��
�v*6���Gy�/�%�f�5=F���L-��RO.Z��2��"=�HwO��E��!SR�.
�f9	�!�|�A-��c���Vx)����g���8�������������L������f���<�.[�c�"{i�D'��C&�:>�p�O-Z�U�	����R�a������M�Hu�)�����S�w��m�����a�_� i���RH�
�eb�����Pn��d�kQ�fH���u�NK��
��q3����69��.�l5�����hT������H^O|�P�E�cj���[B��3����T�\J��br�:�x�q������3ia;u>y��B�o�=�w}��[�.�J�\�P,�������N��{?a��~~�(8w��8#�&�����l��X��]��(�?�9R.�<�^�:���q2���m���"R��G�YM/�J��t6y��h����PT�R ����B v<�
S�XW�g"�g��k�g�N�U��N�zj�%���&giJ��+���;NcZ�R�dvE�������������VS����6r}[��>#��+����7Pvram,
3h��tj�nR�*��#���P��QZ 7��u����$k��LMd5u�'E�us}I9=���9�w��9�����v�A��1:o�������v�C&nRo��:�����be���78���\�(� ���h0��M���XYM�Mcf���Pu�����4�i�h�P,��rm���<���g��[����y���o��Kz�u�qy\=����M>������{N��"���q���-
upg��&�t�-��q�y=�3��
u0�V��$�e�|4�i����������&��
UE�Y��A^���{���Tv�	��������U]���x16o��6������;����GI��X[�����
2n��8H�`w������R)�si�Sb���
�w/u�n�H(\����>|.m��8�������@r�r�G���k:��n0���6R��>�D_�1N,�(4b�� x��E]�^���'�u�O*iIbj�.h�$7����j����CV���<�G�.W��W��A����3)w�O�������'����b^A%��UIdKu���B(�4/���N�w�1O�O�[�G���O�
7X�v��><G���,����q�1�Z'�T��-�Pw0*{Gx����/dJ��|�R��mc�<��y��A^i�Z��:�U����l��6���mb�FV�l�"U�l�-�������AK����e�jz���k�)�q����f�N��Qw,
��������r�.e�S��Ez���U7<��F����:��������v��Z�����1�f����j��N.���3�N4�6��k��!�{���w���#�j�V�e���[�S<���w��E��"]�>�s�BnB�v:��r���J�pA�P��b)R5��l��c���B�@�]q%W���=�u�e�R�cf���J:XO/�����\�eh�
2�E�w�����,1
(9��tN��7T��}��{�^��i�������3��������L������mA���Tc���-������;�1�~���{G��N5����v���P,�N�����S�����?�Q]��?����5�2�t|�Xj��t,P��m\���u�\���_����OWu~��xa�;����Yb���E����A�5�$CV����������L6����@���5��:��z+^�a~o��.^��ixm
s���:(Z�g;/W��u��+� j�;xR��C���V��e�}t~�~������z���wwa��!���1���	Ct�B��:X�H��bv~�`�\�?c�e���
m�C�d�����@d����������#�������x��#����e��jAc����N�y\����?���*��xCX�i���u0��4^R�e�(�2u�?S�T���5*-�JKU�������"�>f�]a*��Cg�`��Z�v�4����2�fa��e��v��l��1�.�L{�����7�,v��{��Dwq<?��W�=j��hz&-�sw�Y
�ZYMg���F,����j�,��1�����m��|�o8��[������C�U�m�`�N[�T�e�&
-_;����w����.�d5X�f��2��	up������.��&����y��I��b�^X�-�����������l�����~�axn�~�`/���x����Wo�X�q���Z�V~N�/jx�Q���;�[���p�VvMXO%���nAs�=��6�N4�uv�S���;,��L����H��:x)K����e\L_����-�
�q��m��5�$xL���C�|��up��upz��A�N4[x:��C��5���6�������:X�e�����(�����8?�{�Rs[tY}���d���w����`�
<	���%5�dh�w���R��E�:��r[t��?��U���4K�vs��A�`p��'��\��o(p�hY���G��y�j0�
f��������mF�:��R�J�t�$^n��is]���2�� <�Q�M<`]����tO�l�V�����Q1{����k{���JR]/G)��:�D�e�G���?���,��y	���O(��.,[������ku�d����9��x����Sk:oP�w�
�� x���r��
��N2�T2d���YOWt�����Mo��K,�����u�������w����}�Q(�N���l8-�Ts+k�����8��W$?J;�l[�w�km�&�$�8��R3��1����}U����l���,Y�[��?J�P���m���������M���������=�'�^'3::�������~�s��������kXX��C�\\\�|�M�R��)��*��c�����<��s���[.��k:���h'CJ8I���/'/$�.����xxx���i����������MOO����d2�V�����:u�y�-��v�V��H$T�T�4�b�]�1?��Y
�d����|���t���|���r��	��P(��y�������>44D�~�m�H�m'`'����V����^jii	���&�F�q�����m9���L1���
�|�`���:h�����7##��p�����/M����y�-�>��34}w,�����K�6/bffFO�FI�GY�eO�J�����Eo#���_��_!!!$
��9A�Z����<��S]]]���I���'�?�8��q2?�P�G�����^�~�iihh8z�(y\RR����<��S_����R��������s��I����xCQ�ez��������[�^C�A�}<~��+���{OMM
M���������tss#�^j��c���&%%���yzz���[G��Z��v��f�l������:�����d�z������R�T*UhhhPP�g���-�:����������Z�q�%3j[H�9[f�8�aaa�rqqy��7�J��*��c�����<��s���<�5����vss��dZ���w�=u����v�V��H$T�T��>�III9q�y�P(���Nuww"�o���H$�|�vFpp���7yl4������0�L����+00���������<�iz��}���3��4M}||.]����FDD|`[��
��9A�Z�����������������!!!��5��hhh8z�(y\RR���������zii)i|��W���9_k��A��������===�������u�������������[o9r�j���	���R�BCC����=;�n9��pTVVDDD��j�����j�&&&^�vmff&//�K_��_��_s������������<�������������������_~y�����/����B}����
>��Oq� �X���}��p���~�����g>���
��+Wrrr�z�O�S��G����r�h|��c�R�<u��/�����o�>WWW���T��K.���g?��/})--�i������{����������������������_|�����������������h�6�CCC���o�;��r�k�������h���~���i�Zn6S�����V��p8^x����D��q����>�H������������.�������{��������k��^��}�����l�0V��?x���
��h�iZ�V���������,++��W�:<<<66��_�"77���>	��@nP����������yzznh�j�...�.��p|�����t�qgg��������s��R�T'N�x���R)i�����O�h4:�g�}��%���g���X]nPP������0������M&���>��r�#-���g�����3OOO��|��e��m��~���^�xu�	}��_����j�?��������4�8q�������8~���F�L�����r������&�J���������9lUVV���}���SZ�����+��RXX�p8�x�
�LF���:WaC\\�����Zm]]��}�fff������EV��>`�.��p:tH.�oh<~�xhh(��
�����6��g?�Y\\�����������?��?�z��?��?�������������2����9��k��:uJ x{{�8q�s����~��r��7>\VV�����=��������t|������p��moo����������������3g����?������l6�������K/�������/{yy}����q����0V��p8BCC���N�8���-N�:u��Qww���V��V�����2���p8VVV^y������j���n����/:Y����:�Px���+W��D"�0uuub���E*����q�tB.�s9p�d2�����q�������MMMv����MMM�����U�������������`�P�������UKKKaaa�-:�����,>Y���o�����#�I�^��n���
liia�o��t��xh�����[����4
��%wV	

jjj�x���������p���_��������z=i������>�������K������q2�2�]������LJJ������_��?��?^|�E�r����������qqq�w������=..���"""����/~q����?�.s����f8AKK�O~����1n��������������������(��\v���^JHHXXXx��O�>M�y��������������~��r:rdd�+_�
�����y�;��7G���������mII	��E���|�DGG�7��������>�,3��f���W��~�:E]�����"���������+�._�L.1�LG�!�f���KE�%?�������_ngg����-��F��Q__���|ehh�i������G��~~~�et���������m����e�_~�er�Fww�/��a|�����{�����9�^��o��\�$�q��������d�$��|�2����x��q��.s������^z�����Cl_vM�z{{_�v�i�|���o�9==����Z�`�
endstream
endobj
20 0 obj
   16074
endobj
12 0 obj
<< /Length 21 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yP��7��Z�(eS��"$Q�Iv��*���u��&�
���vsl�}�-ecg��d7��1�!!�I�,i@\7�@�7���f�����a������;���Dw#��������~�{�7���=v;0�d2������m�����������_+��&���"oUT�T�������rvV2...,,�d2�ff�P���o��i���#����vSS��k�����|�����5���l���

������w�������`0���<==���
�z=5�b�dgg������Z�V������x��N�c�AQaa�������C�Z��x�"Wy�����C+�H+�������b-.,,xyy�8��f�t��P(|�E�<888���|||rrr


L&���'y}������h��;������f�m���������
??������9??���
������x<�X�E����9OO�������������<�������QNN���������MMM�x����G}��+**x<^^^�65���x<^hh�����dZXXHLL��x���d���>�w�������������������+W�ddd��322�\�B���l�z����7�WIrr2���x2HzD�(
������lUUU<O�V?���<�q�r��w�j�^�t���czz������������x��}�W[[K�P]]���rssI��������]ZZ���I�YYY�������j�#����������//������R��Bf�x�p���VWWSSS}}}/\����i2����������s�����t��@��N�+))	

=������+W


�F�V+�������PO����ys�cY����������_�p���{����y���.^����������^������K.\�q��C?��r�T��a%���|�EEE����v���KLL�������|���������m	���t�tu���M'Y�������0///__���p�T��V(����D�u���333��ob�������Doo���HRo2�JKK���I�����7i4�����/�?>&&f~~�1����|LL����������4����o���tc
�������__����GDD���9��C�w����������ONN�x����?��������j����EEE����>�O�^ZZ����N�F�S������������C>�d�]������w��x���T??�������������E��X,�CURR���w��e���U���Z�G}w�������|''�y}]QQ�@ ��xW�^�49�������.--�����x��E�
���III2��:����*//�����x�������r�������'k������G{{�n������<==�BaIIIll,���|�2���������
1�^^^���Aaa!��		)**���%�7�W#��t��������o��M�:�Sv���u~~~7o�,((x���/�:������w��%�����!$$�\�*..�y�����c��K�.��q���8<<����"���7��5v��������l���<������m������Mm��H$</;;{jj*66�|5u��������Z-U��j�z���;wn��M��}�65\-%%���?F���A>�|>_�R����n�1$�]����|Rw�HNLL��)����x���X�j���������j%���x<^?���-�V��<a6���r����"3�c����/iii��"�
555_v���Z��F��������Z���������9&o�]����_-��H"�P5�����k��������ON�I��S�����E6��f#���h��&&��t��@$�w$��^VVFIO]RRyY��JR������hrZm����F�;��;����f�.Pnuu���K|>������lM�Uz��A��}��oooz����===y<��@�^�=�'''}}}SRR���/\�0;;k�����x<�H$����?�OeZZUc4���.� u��b�����e��MW`eeE.�o4#����"�G���{��h�"����Q�w����3������;w�ll�B������9���C���������v���dmlll��	*����6������!�
6"gp����s�f*�J�������9#$6nb����T�l��~��V�y<^XX}�u�������###T
�n��I���/����zR������<���c���fk>	��"33sii�b�LLL������xxx�x<��m6������j���������WVV��:444,,�j�Z,���,rYd��*�3�)))�j��O
v���$I
�3��"�����7���J����s�#�����<H�%�`0�s.�fxx8%%���k�����-):$����������I�7��.F����3������n�&�Z�����r���m�e2y<<<"""����I?�94v�ML]#�/�P�����dr�J�
�z�c�A�+�w����>�_5�-�����M������J�t>H���o���m��I�����a2��o�F�4�,���~~~s�6�U��m�A����Ug���l6�&�.]JJJ���$}n����U�Qw�C�6�p�����Rg������C ����������[��y�\���%%%�)����b�H���w�R�]����s%�~-�lq�-�BQPPp����YYY�
�>_���7mI�����d�l��{�<�q�����{��y�l��������m�qknl��w~r�����a�'[���!g��'�����|��v��*��]�A��c���}
�xN�X�t:2�V�B.\��x�����I���sH���V"//����]�����y<��(����������B1�L����Zghh(UO�Uj���}�[YXX���![����>���7mOB�ku@\���WVV#��^�����`SS��������jk:��#���III����U��A�H����[������:������Pys��*����d�"u����y�|a�w	R����u��������V8 ��?����������if�ZCCC�����A*�o��T������Be
��L�����o����kL{A������I:���th��eAA5�d�<��&��u�Q��`�Z]^^N���aK�����L�Q� }�	��KOO'���� ��~~~���������dl��W���|V��<-�~�I:�;L�g!..��Dr���0bll�>�h��*�
;�A�����d�Z����������$y����|���5�����%��N������ZZZ��"�������fsww�������@H2r����`0,..������lz�%G���I2��>OMM
��KLL4�L:��a%�q&--muuU�V��<�|����V����R���8��H��G���>��j�������m�.��T*�V������I\���7������<��\.7��
���{�l���������]]]f�����z�m��
������������E2�n����GO������R��FcKK�������C6�������������2�����apuu��������I���q����<7n��v��B�VwU<v���C��~Q��J����gUU9�:<��200�0xr7�p@'
�E���vvvR3���������\@}�7=�ROus�����L��!�V���h4�v9�����q�;n�����%7o���Hb���O!�������L&r<�x{{744l�V�Wx�M���jmm%Gc�P(���d����SSKKK#�������������<��j��y��%j���m�qk>���Z�����#���3H$�����w�B:~�=�dW�xW�����pY�>8�|�[ZZ*++���,���8��������UYY���@�uuu���SO�u`�Z���}}}�z����p@>�6�m``���R,�$B(�J�HTUUE��gO�a~����g�d���
�_�0���\mm-(�q%m6���P����a���]n������������!�U-�����{���M/����������n�f�������UTTH$����o��
o����^�F*���s����������������6�V�������������"���7���m��l���d������H�j�6[s�������������M�|_YYinn���hnn���"���:D�pW�������Fw�H'O;�)���������X,��]		y���'Z���u���O���a��>�98y����������g��G^o��:��\��3|�IO5�H�0��G.�m��u`��billLJJ��������---%$$���P56�����������J���!Skkk��R�t�+l����������������~Ri�Z���������C���V��3�L�����p������L�V[,�������J��MJJ�~�S"�dgg3�C?�u[M���ZXX���j���brr�����l���ILL$/%�WWW-���DZZ��l��~{������PRR�N���


mmm*�������j�r�|�/�K$���L��2>>���H��k�����+�������2==-��zzz�C�^�P(���:;;7�F��y�b�$''o���;6�n����tvvR�cccsssd���e����frrr����q�Sc�LF��������D�<��������X�������A�����c��6�������%--���VVVRSS����b�������I�H$
����III��L*�
����L�X���������U�m�
EZZZVVVlllJJJ{{��9�j��n
�������2�lll�����X����b�lZ?99���EU������������������?���`yy9--�`0��v�R�������������Ur1�			MMM����������b���b]JJ��h$�mmm%%%$�eeeQ��������a)R�4??_������v{OO�P(\XX�Z�j����������w��)++�����yyy��������T233[[[����lB���VX\\L�B��z�P�P(v����t��������RiFFFBB��4JWWW^^�P(��9�v{BBBKK��rrr���;::�?��������R�?1�i��6y����rJJ���C�l�����K�����&''�����������|�L�M�\.��Y���D��B����e�A��L��feeQg:���btttaa�V�5;;KN����
�s���Xz$9K&����P�k�Xbbb6&�M������b���ub�M3I���b���5���V�;������j(���:�����������������D��_�i�n���:�	�rrr�RiZZ�������h4)))�3�f����P�uB���6�$�����P(|�<���]UUE322v����"�X������>��.��L&���l���dx�V��GW.�'&&�����?�o������Xvv���O�LV\\l����������-�iWNN�\.�j�z}jj���������T*�V�B��w������hLNN���2��KKK����B��A�V���455E�S'n�i���XNN�����j���HMM%����


F�Q��577�>�m�I�KHH(..�t=�/
�?%'';��.�J�Jejj*��Z[[�|r�:������yr_�q %%%&&���{�8�R���Ov�d2�������G?77���Mn�����}3����~�P���=??O*�Zmii��;w�������s�����azz�����+�<�{&��������������Hedd��c�����y������8uff���SNNN/��BTT�x			?��u:]UU��/�h�����\\\$�Z�~�����;G����]�V�����=�P(�k�#		�����v�Z���g>c�������9C�������K��7����:88H*�{�=�P�z;�������mOO����W�\���W�^���&S�z������o:���#��Tzzz����������}����W��������t:>�O�Z,�C����t�s�=g�XH���o@@��MOOK�5==�|��3����'''��|�All�@ ��	�T*ggg���Nuvv^ZZ"�g��
g������J�����_�������'O�$5%%%nnn��7���[o�����7�x�w
��]�z��_�����o~��������[,WW���O�������A/��H_���)))...��������'�f3���������������ER�P("""����3;*n:�n�WTTFGG�T*.��j���������|�3�����f��n���<v�����;��333��/�����:u�����^�����)�����?�����gff����)99yjj���E"���������s��7���uj��������
���<�7�x����^���v���w__�K/���TWW���AR��{�	�B�V�7������8��l��g?{������G�
		���W�^���~�bz������o:���#��Tzzz���3�+:�vfebX�+�c���G��GY(�
�,7�/��/x<���������kMM���?��'S-��C���o:�����X,����7  `���������H��"����rqaYsAAsFAsF������� ����=zT����y<^PP�@ ��	�T*ggg���Nuvv^ZZ"�g��
g��gV
K?�9xd���X�/�������������III���'O�$5%%%nnn��7���[o�����7�x�����F�^2[M�$AN��H$:~�����|���]��X,WW���O�������A/��H_���)))...��������'����:Py�n����DEE-/?�:�P("""����3;*n:�n�WTTFGG�T*��{���A�����P���X�tQ��f�}#���� �V���t�Y�jX�����=s�U�2����]� 3��YW�**/�S�W%��[o"�u��e���G�*ss:��������b~w�����n��k�W���g���S
D���P��d��T��Z�zT�/�j@�?��""~������R3wo}��*��J��Nq@�yG|%��Y��9;Ey`S������~EW����������V��-��2�<��Q�t ly� � �F:���;��;up���5��@�}nY�X;\H���c�-V�E�fF ���Q�y�3��� 	�����1���j ��O ��,�1ZY�/�����q~�F0�����������xj�f��c��^���������8?T"��@�}nE�<�`0��Q�,��y��B�E�q5�3�f��_�����tT
�E7]�o��x�Z ��PD��X��F��&��ayveB ���\#��f�8�2F�>��Z4�=]�m.�I)�I�h�&?R�tqH��������e�fd��O���_��J_���=���H�!���E�i�]�$7����5�9����m�0�d,t������.�s~�D<��=�����<�� ����k������`���o�)V�p~�D<{1������u'!���3i"~D����ZA0�>��;/O�	��N���c��.Zm������;�-Y����$��Q�C��G7]�j�V��)Q�`��g���m�-������p����D�h���b���tq�`�{��w�-���@�/����+��'���b���@��W���de����DLSZo	DRe����#�n�kNJ��!ew������������������13�;1���-����'�n��n�$Q�1;�$��������A��N�x�P1�������f���fT�prd�02;b�|�����B1�+!����<����u��y�I�a�|jytjy��bY&�����P~;�) �3����`4���f�*h��'�=�M7�)�����"��f$HB"~V�]��Fy�Qi��?�9��5������
��9��C�����`����ep~�>P�5+-���.f��2Q���(�I��(}p"�+5U���\�rt�s}Px���f����9+�L�v���P�M5.t�'�d	D�IX�xU�x��D�����S������&��!n��(o_�����Q�
f������o\�\�����u�
�)�K���3]] 'e9����7��r�d���z����-�j�8����A5��=�4VA~��b�H	���������z�\,K���6���&1������v�.��(8��Bf?�q1�7�D�!�d����)��IN��������^�������,�����o+`��t��/}d�`=i3�1@ �gw�s�#������C�����a�l�����f�ePK?�����r����>�<�����S��R�s�A������xov> ��)�0q����?�-x���w�7@(C���E'+�eZ�cd�����1�N�gu�m���<G�}�������fv�]s�~E;�9��Hl*)����9��&A)��A������)��R���`�d<I�0�������/q�=)��Z�MF��/&�����E_)��W���7�)� <�����}�����6%I?&m���-!�'&�#��������U6�AX��M��XD����������N�skQ1���z#�=��4������������e�����Y��`l�
� <��W�l6�[>�P7R�jXb��5�#��E�{�Y������"�w����'��@6����8�����y�x����up�,�V��qU+9b5���X��._�0��/�r�j�Axr��:��?��Vj�������0Pt0Z\���o�y���I����.�_i�h�����L�*��3��1h��<���MV�$�7�o/�,�7����9{���	Ym����p��x��a�)�����3]Rvs���\�(��(�N���]��2������+"X��` ����u�"I�������T�4'M�w�#my��� <9�I�yb����������q�A������r��U|�<�� 
�"G���X�`.��	��f�z����J�cX���h�'?��w�O��B1�������o,��I_���tF�#��AxB�+�?U&`?6��t'���+�II���n�L]Gc�xG|%��`���gp��8�����e=
���&_�a�80�<Odnu2^���~���'O>�<M����wDn�7�(ob3�%�Q���P�#\|��B����%�zD�d�B��@JO
,��|�Zz���7���t)u���y��b�����(<���>(-����y����N�u��e�q}~���� D�X�R�I��Tv�*;u:t,���]��|�2m�H����7O-Z:�Mz��qj��BqLa�� 
O=���FK9II��74������������^/\�<G��bMZ�&����an��^]E�.[�%���sf���b��>�^S���Ypa��^��@��T����j����(<��'�IL���1i����~d�
�k�dE�N<hH����u�
���,b�8��A:�Z�H�N�Z8����:N��<O��q2\���fr��B��n/�����ZV���O�g�8�4��G���	;ENB>j����j[��fs?I�L�%��A=� <�8's�=F ��3�1r����;��\�$'�-��t��ML�����>f�����i25d<I]���"W�����'4�UN:ES��.�~���U>�HQ���e�0�������O�8�M������C_��g�X�cS ��I���4��U�(�G0Xl���S���'�����M����f1����`3�����w>x��<h6����o����bdd��c�����y������7N���9u�����/��E�#��4�:�"���dc��1��]�5F��S"*7
+{���N�)
=�&9L�����6h�a�j�����S�l�rww�����\�����O��$NMM���H$�Z�����;w�>��S�N�V777=zT�Pp��}�j����G�����ae/I����=Tn���
D����!e�����bH�����l��F�K���V�8�G�FQ���]+5�L�R��:{�L�S&��B� ���$������~�;������9C&������K��7����:88H*�{�=�P�z#��<B ��>��r����7�;�a���7-�+!��/�)���d�x�. �4�������:�CY�HS��_0O*-m�����w"3Z���'M�A='MF��	����f{���������n�_�z����L���������#G�R����r+���<�o�nb9�M7	D���@�o^���IuQ"��Lm��A}=��B�rR$����5y���z����������l6*�����|2�b�:t�>��S�{�9��B*}}}6.hzzZz�T����Y����q
��
WR�"����?�G�5~�h�h����6�T��jX�����������kR<�\���p1�h0�f,S�d���clj����g�����_���}w]uu�@ ��	�T*ggg���Nuvv^ZZ"�g��
g��'�6�����P���Jn��wN3���M�	�eBi��1��i�.r��/'�����'�<X__��I2�������>��S�z����RR��o�������	y��(j�R�d�$�l�/�<���&M]g�� Au�N*-c
3u�sE�	���E-����������]\\����u[MMIIqqqqww��wO�8a6�i��]��o�.��s��`F��D��$C�b��1J���b����?"��y��tv�]�PDDDSgv�O���������*�����w�����:���tp�`a�����Q����K�1nz��
}�����X�A<O�Y�I�h��8I}�4�bV���~�����+3�t-C��k�.Vw���9�'D {���N��xt�#e���������<<�6��~���"�xfy�Y�I����6	�T�`h�x�@|6�l�D�u�����dZ�>J�;�Y�Q���l�Bl�
'wv#�g&����f���t��Iq^=�` �}�����f�����x�m��d���������F��C�%	�/��|�br�*k�Z�z�AS ��Lb��x�@�s���?�3N��Uhf��
,gW'e�������������x��O~��b���+]|�<�l��_�"�7/��i��<��@ #�����)�Ik�$gO�s�u��-5L��
���[B�m��w�����XF�I�E�?���;����|����0��?"�'��=�jXf�
������d6o�[�������eZt�"�'��=�a��e�fv�A�����\��k	)�Ic?	v���
���8��k����JC��V�������_�@<I �������+����D��}Yw9���em��6m�����������~.[�����Iypo�,����>E�y0���mQ�������������%�Se5���Q���\Q<�A2�Iyp�qu}0�a����*��W9���Qyp�q�+���}���&,��>>����r'��h���,t2]�]���)??�I�(i�����8����m��;��cvx��"���@�[�8�!e��BgN���W*s:-��&
D��E�?�-\/\�]��V,��\�j�48��Y���[��7C?�d�]�d�/���o5�����R�����?N��qd��P~�����l��,r>x�A��|y���+�ry>�7m~�ij0�V�VY�&-��k���d����,
&[���s� ��������RD���{kp�K ��4��\���#T/%�y�P�)j��F8�88������5��V�+�r���:n>8�b�� Mb��;�V�X��ko�����x��sK��Z�_����aZd�xyp�M,
��9���2m]g��>�l��l�S���1UO�X+�/���/���5��A����I���X�fd��-1�.Y�v{����W�����V����w��w�4��`e���.�Xz��h���5��yG|5��~Q��`a��@�-j�P�^a�(��K��An�����d������W�VN���*�O�^���4��^�[���:����O?���l���c��v�����b����5
u!�����"�d�%���l��~��sW�d��N]e���b�7�'�B�[������fg�����O����� ��`�������a������BC��d���.T���S�����u������
D�8IHnWZN�=�������l�X���l�vN<��,v�[u��|{l4�[���i��.��|]�:��m�C�����2]l0_��������Zz�,dd�G)k[�:7d��~�6[�Y�E�[�|0�$��[��'��J�-F^�.E����Pfb��%6�-[��<an��o7�n���y��j�X��G�X(�7������o��m����;���sM#���$���*�i9���[���8���4�l�h^����b��h�m�M&F�!�������J��ly���U��E8P��[����g�8�d5�l����e��Bv����K�y�$�s	
}Y�QY�6�b�*�7�?��fT�{�-�n^`��jG�s
�eu#�����l�������Y�g�>LT�%)��/6�s�%��KY���n%���Jc��d��=n�����F��?�:���+5�,+:�}2k������X��/�E��oU���^���K��I:�78<����n�����"��������!eyu�P>���(�yP(Z�����A2�����F�z)�1���QKM7c���<8�`���
������lU��Ogj�b
���6G����2�
� <\d�y�?SU���$C����'Q3Z����I%7{� <!�A&p�C�f���5]z�����Z�����b���$���D������s�Q�q2�dL�����4CskWv��L�Fc�'w�3]�{f{���"Se���.���1_,��z��o�_��h�*����v������;�xqx�\�f*i3q���)��j���,k������Y-&��u��E�g���pQ �W
���E4�1]Ll�!n�C���8Wy�7�kO�J�O���-�5�������H��t5��1aa��������������o�a�USM]�$*V,%m�����]����2�S������?~�y4�p^�<�Tg��1.��
`��e��b	�q1����w*����G�]<�?�
��0q���?�kX��9ch�a~y��yP�U<�������������!��+:d@���I�)���������ZmV�mvi�Q���/��V���.c;�V����v����1<LNV�]�J�R���������i�
=�O���'y0�a�GK��\��4�&�^S��0C���Q�l���i����������>���<��L(-��-sK���/��?M;�`������������I���K�M�8%�q���n��k�['y����uO�-�8�'y�`�@���d`qss{���?���������y��1''�w�ygff��_6N���9u�����/�������3��a`��bX��-PUid���������K��jugg����������\\\$�Z�~�����;G����]�V�����=�P(6.��\��j��-3��`��3e����C����]�|�`xx[�G}���v��R�����K/���t������ �|����B����[�
���6����0�����jKK���W���I�^�?|�0}�M�9r�J����AAA��>����l��?�s��}��������I��b9t�}�M�>��s���(���1==-����4�I���d�������pR�5A�J���L�y�����KKK������K�sJ��;��Nhh(US__��I�wII���}�M����[������7����c�����������'���-����������]\\����u�/t����ww�w�}���f���f�����
Iv
�""""88�:�����S�v{EEE```tt�J���)y��1''�w�ygfff��333�N�rrrz�����8Zk�=055���"�H�j���������v�Z�nnn>z��B��n��HZZ��3g��}}}/����S]]]I�{��'
Y_k��q��Uooo��^�?|���S�9b0H���gPP�k
�7����|>��b�:th���=���b!����_6::������f>��@ P�U*�����S������H���g���Y_k��Q__��I�wII�����S�z����RR��o������{�b�����>}�������z�^�����ww�w�}���f���F<>�BL��Qyp��v����"000::Z�Rq��O7�����2==�����/}������^���������8;;?��sG�quu���~7;;��r{zz^{�����������2��[���|�S�bgA"���_���###�������������������+W����h�����'NNN'O�aa�����1gff���s/�������C������������r�Ms���g���/���S��m�g�|��������7���w�;66699��+�0=����s��\~~�����b���������oN�:��r�v��o��������������T�V�s������]��n�������v���'?��c�V������/7440�����^x������7�|���[Z���������;�������}}}z��b��T������eee_��W���FGG�������z��'�������j�'�x{{3�����`www�J�Z�����r�v��?�i�FC�noo������1S
����3����X,&5��������z��n����Gyyy3����`OOO�����>��O�E������b#5���s��a�
������&_�|����V�o���/">��~�����j��G?������3g�0�{����O�v��H$...�.�n���_�����b����_������*++{����;�V�Y�u_����B�����oK$R��������������O������C�MOO������/~���.;v0F�k���;&�J*O�>��r���~���QE������4!!y�ITUU�����_��_^�v�_��_~��_����������}���'N�x��7��;��������9���}�����.�n����?~�����IHHx���Y�|��h>�������cm����_���������������������o��G?b�T�����NNN���jFF�k�������o}+,,���b���r�v{DD�����3g����|��s�N�<��������rU*�W��U�W�v��������<�W�judd��7��E'��\[[+�\�"
Y�@����D"z�X,���eg��T*es���`())	����q�Fcc��fcm�DcccPP=A0
;v0FMMM���]�z���_ ���S�f���bdd$�F����}��E��ejj*,,,&&��ZG�0�\��������BU���4�`yh������7###WWW�\.y�Jpppcc#����o~~�_���3g~��_~����j�������/?~����������I*Y'�&��d�T���z������o~������������W^��c&�l�����G�qssKHH���ww��}�k			�.����/~���.���,��f������---?���GGG�9B:;;www�������&��BR@��l������������+�|��������wM���~���T*�QQQG�!�#�������0�\`��7���o���N777����%%%,/MfSPP������o|�,!��yjD��j���~�BR@��l2u����WcbbH%w�]�|��*b0N�8A*�F#��C~����������������L&��&�����+_���� U��������,!���.]�D�Fg2�����|�[�b:)��l6���^#��tvv�������\^^�k��F=��n�k�������~�����(,,�_��&55������._�L�(����O���3�h4��&wvv������c��1}�5�>���B�\�|��w����bz�������
endstream
endobj
21 0 obj
   17116
endobj
13 0 obj
<< /Length 22 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yP��?��Z�(eS��"$Q�Iv��*���u��&��l'�������xK���d�&�b����@�`K�q#�-.q�0�p_�5�1�s=3��G�L����tK�~������~fz�3����V+8��h���HNN��_]]�x�buu����diii������������E1���Y,�XL9���B�'WWWs8�
�_���F�q/3�x������@��k&�hM�e�����_������r�<�L&��{����}||�\nRR�����z����$00���+88���L���S)�*((������/++3��v����r��e�V�T����2oo���%�J���0}

���p�Rzz:�km���mw��{���.z���KKK���eee��i�X�\����t�6���r8__������R�����Q^�IL&S||����������d���u����Rmm������������mm���

q8�@�F����//���B�J�JE������j���9Nmm�R����===~~~gbb��5�a��X��������kqqq��fgg9Ngg�������������x���PMM
��	���2�KKKiii'>>��0<<��p._�,�H�F���0�6����]���%���^�v�~q��z����U[��`P"�������%&&n�&|>�������V���WWWK�RW�1��r����ddd�<[}}=��Q�T���6n�LY����l6_�r���S&���:�������������p������p���H��������]UU���E�YWW��p����j��!_��w�������y{{UUU��������d�R������w�����<�����x��-__�K�.�����z��Zmeeexx�������]�VZZj0�[���U//���h�������r���Lrr2���H${+HC���SRR.^�x�����o+�J�y$Irr��������\n||�����+

]�r���K7n���G�}?�&Uo�k���h\\�����8�e���-,,��F�^�z�����U��DEEyyy�U�����q���If����!""������/**J$�����iii~�����v�H�***���|||bccI��h���
		!�*..�;��V����._�|���������������			/^���W��v3�������3�ZC�krr2!!�������������v/h��]?M�U�������gff8�����G}��pl���z��r��������R.�K�^]]����C�K���_XXXPP���K��d�=�������������`����������VVV��p�����E��T\\\eeeqq���W9�'�9���X[[#�,� Y������Z��������i������###9Nhh(��{zz8���AAAUUUJJ
����]$�XPP���.��=!����������'66������������������������g/�����K����x<^ee%9d�z�����_�����b0���������2�V^^^TTt��e�W#��r�JLLLEEELLy�HN�u �����y�fiiiMM
�/�:���III�����^�#��***n���y���+7n���� ������"v�:8�L�j�V������������J��"��}/���V8���T(r8���������D����G 9���h��FC��o��C�������effr����R`��q�\�BAj�����KJJHqjj��7==M�-������p��KKK���o��$w###;4��NF��s�v������M&@�����d����V�����,"������e����h0���===��I�ROOO�O���kjj���A��B���!g|����
r8�G��?� ErLl����|�)[,��7o�{�]���+W�����nzK�Z�w���}IO]zz:yY��LR������xrXm�X��Fo�~v�5�L�.]�?8�Ry��.�K�_nm�^>M�Uz��A��}��-�qe!!!��������nm6���@R�������������t������j�p8|>��?+��.;;��1���� }���(R���fW�p������v�Y����:��A��lpp�h4��j�e�������������I����o��d2MOO��uk�[!��������P��l_*!!���������	r�&�J���P(l3�VEEE����W�V������������tMyy9��u���i]VV��l����S�T'""�v��P����mGR�m�����3�����A���6�Y�����������{�4�k������JQ���tDD���OOO�c�t����p<==�������������u�����0��E�����&��`�
TVV�5d����{���?I��W��E�z}MMMtt4�l�m���g�v'�Aooo�c{B�t���+����%Qz��s�5��������_'?���|I�.q<\$]�v�6��Hr���g0��9������v��������a�Z�_�����X,&o���gtt��{�VVV��i���k���}��v�d��V/^$3�CT��D��=D�}�.�����`�/��'���{��}/���*=�� �b{��C�L"o��EF��������%������/l�zU�v8���7a��u��kL&@r��������:���������7��&����TYY��!��]79�����AOO���������������������=o�����Q%��������� �c%�~-;�hq�-�\^ZZz��5���������
�9_��
8lI�UUU[w�d��l��{�<�u�����v_�]� �d����KR������9<?H�7�~u�m��S�;y����	J�����G�;\U��=�A�=��;��k�����(:Ii�Z2�v}>�.]�p8����a��=r������������!�z�����z���b2hs��?tloo�p8v��rss9N[[�^>8��h"oHxx8]O�vz���v}��YZZjhh ��s�up�:2������e�����y����n����klkk�yT���Yl�i����3�v��P*���9#i;��\��-��������K����`���S2
�>�Oxx�<H~�������������9���M�6�����������,�RI�����Y�[^477�&#r<Bw������p���� ����A:��6daa�z�3��d"�y�>���utt��c��*���;�i�Z���w8uh�Z����eii)=�d�<��6��u�R�a�*������aC�-577�3�#����=��s/''�w�:���'���o;���������l��
{�4��l&wK�=�$��������L|"��nT!�Jm�pU��`�<Hn�����R��Jeaa!9���y���������7}}}��������8NGG���`4CCC�������"r4�����[A����dd�d2
���y{{�y��������+++YYY�_��w�d�933C�������@��F�Vk������V*�*������{���i4.����)����s��������}N�f3����������ccc����f���4���N\;��]����������x�bWW��d���d��g��d^^^~~~���&�idd�~���
������������Mw,��u��u���*))I�P���ooo��l�/��#"?|}}srr*++��t///��A�RI�������
2$���w��3��|������������
����ypzz��wQQQ��/j6�moT���U__Ov�v���I$����6��h�]���m~�<<����������axx�v`yxx8�����.7==���yfgg��6����!j��\.G���������
�
,�y���^�$V��D�=O�/n�h4��9������e���v�w���L;;;������l��b�����UUU�m����������a/yP��������+� U�m���|t��������oWW��B��v�w�F:~m{P�����
����������������������(��_�f�K��l������kii!�{������1��!!!~~~�Q��	�?���644��wo|||�1���m�X$I]]�@ ��������������=r�)2��a�I�X,���%�uv�,,,466��"[b�X����m�[�=~p���uuu���ccc�g��F�����k��'SSS����Cg�X,R�������V(��Z��V��������Z��D��t�U������R__?99����zf�TZ__���f�(b����5w���299I>���n�>�������_���AmmmGG��+�������kkk�����3�������9��]���x��'�moog{E`O(��~�zXX����=<y�l6�mb�i������===�wsx�`{{���'���b��[^o��:�����y�*|�IO4>�o7��#'�v��:<��Z[[�����������_������ssst��b)//ollt�J�D�xG����F�J�H��vh���L^^^bbbVV����4��mmm����n�*,,��]���FcNN��v�����{W�RQ599����P(�����������:�BaAA����#�����������_a�f���ddd����L������4�R����L�TR5==���m2�v��YmmmFF� �c(==]���������n�B100@Wj4�����.
�yyyEMMM����>�Z��

�����Y���db�xpp�~��l�\.����������� EQ;?h~�fZ�������>�^*�.,,���������������������b1����*++�B!� �����app�l6�����������[g�X,UUU�����n}}=++���^ ��s�������|����^[[���N��D"�����555)))J��nA;�A�\����������������kN��i�Zy<�R������R������D��
�8��������+������������o���t;���`mm-;;[��[�������4�R���]RR�]r1���mmm��������MQ��;w��233
�������$9.??�N.���===vK�D%%%��$���V��� ��[ZZ2��*����������n��{�n]]�@ (..���&--((������������]��b��x�����
�B���x<�\�k�.�N����s��O$������n=@�����x<���j����vtt�	������������:�.99�,����w��v���233�����:e[,���"2�f~~>##���I ����������jjj�
���QE����e�A���F������t������������k~~�Xuww�����t����y��,�X���A�/EQ			[��|��$BQTbb����Ir]$�Y�V�J��I�R544�CI�k����$�Z���B����@ZZ��lr���]���vG�����B�H���=66����juff&��X,����a��#�Ml�����Pbqq���=z�����+++���{ofyy��X������a��*��F#��X,YYYdx�v��G���+--m}}�����%�<�x�J�[{>�bqEE��bYXX����lI�
�����N���511A�����.//��f�\~�����9��������o2�VWW

���G���&==}vv���������L�TZXX���n6�������HoUUUKK��`�j������s�z��RSS+**�'�EOv��D������,:�uvv������$K�V...��������			�o��O�-//������b��������/,,�������L2idd���%&&,..�J�FSUUu��������&�p�zrx���399�p���h4����...'O�ljj"����'N�pqqy��7�����e�����W_}�����g����c�#55�G?��V��������f�Zggg����B�J�z��w.\�`;������R�����?.���k�����V�J���g>c�Z�����;G�?�����;����>::J*�z�-��x;�L&��w�����������kV�544����L��tG������c����zR����x;Fnn����}�_��W�����j�\.�JQ��#Gl�w8�����(�T���n]�L&�H&�9?������333��w�}711122�>'�P(\]]m�w8���uuu�T�?>**��F<����///����_��q��������������3��;���k�UUU���_~�w
��

��7���������������E����={�������z�����n�������������o�:u�d2��,��jll

���[YY!5r�<:::$$�>������V����6(((>>^�P���O�eE#7RFR\�.-��()��p@�_�������M�M:�I���a�8�:�v����X,�������9������p~oR��C�R�����ag6S+�c��$71k��[���ld>"��@kT�
d��!�!���F������2�M#��6i-� �nU�\��T:�I�J��b�`��bYT�XI���`�l1/�dQ�Kqm�s��s��R�h$�����b��$�<�+�,?�-�c�����ZI!���
����p�L�h������A�L#����f��������~�3����~g2��Vkll��'\\\�x����9��:unn��W_uqqy��g���nF�0�>]3���V
�J�B��b���d6���������Y;���������������sss��/����1;;���&
U*�;��s����N��&�J���~��q�\�p+������f�F��G�F���Kd������glyP��g�����������t����/����j[���}��9�������>��Tww���QR��[o�x<�V�lqd$��������g�}Im�3���������y!���`���~��7o����?�����0�����C��t��G����p��c��z=����
f�p��Ik����������|n~�-VZ=��a=1�<�W�Wg}}}jj������!  �����E9r�v~�S�z�)��H���_``���d2�#�TT�����~@� _$�
���
�d}�Z���
��4����u~�����j��o�I�T(�����;�������J*��?�l#����������b�x���d�eti`vM�����wD����Yj�JmP9��-�b�N���x�?�}~���Ezzzss����IMee��3gl�w8���^���"�/��rSS��F�
��l$�R$�;��YP�2P���`}-�i��I��2P��Mj/Gt>>���IOO�w�}����Z��(�������nnnd���M�/t����L777�7�|���S��8x�d�O��?�2�&�i�������S���Fz'�`+��K$���������5R#�����CBB�#;::�j�Zkkk������
�Mf�/�"����T�������<�������*�x��-5�`�(U9t��/I���mM��.Rfjfu��]%��X��_�/1VD�=2����I������$��b�\l�THX�3#�������H>��s��pE�v���<d�S����y/�ku�Z�)"Cg�J3��������f2S[d��O�nu`���SB �1F�����9��$C1�����W"�\\J�pFt��6D��<��������|��gv�,�$E�LS��Cg���{��D"��lre���4����������F��F����(V�l���Ax�!2��{m������0P[h��,�K.�K!9�x -�7��"��<{�q2����w�#�=8����n���`��*�)�b=1!�c(��d����������)��$AhQ*�c;�X�I�P� �Go�-���3).��]4[������ �s�1�<T-�o��$����4��u�E�.�H
6N�u��� I <YU2rh��S�M:�I����������Y��������I~$��%�$�:r��6����@1�5��/��qHy��b2�J2R���V���i�(�L�������'{������$���y�����FN��=9�]�����_kx��e�+Q2��%��4�b��mgG��e�����C�-w��'����|T�(��Z�X��YOI�G�\{��F$�[2��8^�8^�*��������g�C�-w��,g%	"<YSu�m����C���������;=	�M�����)"/���$��5l�`�Z6��Q#�g=71�3���I��cX�W#��=9tJ:������p��k��f�^���y���a���a�,��M�"��]yt��X�k��E#��lx��E�s��#k/��m"^ <���w�&*(3��b�D�p�afm����p�g[X�W�(����U$�T3�

�����O�~�2Sk��e�b��������E����S���MZC�k�y�Vi�h��e����pH"�����y��po� <�,s� �|(����G��F����������v3 ��st�$��-k'�hvj1��r~_[y�N�������f�#��^~O����(a��1w��j�2���o���q�Q6�U+)$;�$��$Ah�@�4��l��9���|�"[��tYjke����.�a7	^�W�w2����L-(gz���&&#�-0��=<W���V�~p�b\�Y���`,B
Iu���j�� � ��1�M�'&��|0;]�}�v�#����d�V�OY*�`'%!<)a�H�C<VRR�8?�JX�X8�X��6G��kE���U��3s��<Y��I:�A�'����G�����J�K#�^���N-v��$K��$����?��9��B�R������vuH�����|�)���YOL���9�~-�5 �#��$8�<X5|'����=�<�@1�;��$X���A�,�G�����F85������/HO}d�XHM���Ta�S��Y���LV3��z�u�E�c����G��gt�$)�����������Ln�b1�O^
+Y-cc#��@�'����%���N)F�W��W�"���]�\a�0�zd�	�t�:��p��JW����S2�
�B�!+r�y�h
�R���o^Hg��yQ����"+�4Q��y3�;�������o���y�z1�'r�wj���x����Mj��b����=��[P��������ZV2`���7g��N��Vuv�zl�8�B����M�B�fv�b�80��AxR�'�u�6�����������)7��_H]`w�JV���}5�IQ���nV����L$C1*�����C��E��-R����-��[�k+C�.&�ev�,�b>���Kp0x���O�|D�'��rF<�\3����U������j����H>�Z�O�����+wz���������������2��sWk�+0h��� ��a�����`�?�-��d��/��/�NN,���u[3�N�R��L���cc4EN����m���Z�����J�,t�"����<8��uKp-�=����<�d4h'M=RCf����������:����&��4%���,����(6~�5����q2���������{�9c����K�g,�'/�O^b���GN���[;���A����D��kS��:
L7�QV���_������2Sl�e5���A���8��`x����*��,<.����7�������yf�T���lR7��'���������R���S��T��a�H.a`%	"�mA9�9���8�����y���;f"[X���.�,e����B�pL��I��%�h��.����_z�$C�C���_V�������!R\��8�([�4[���Q��C7[�0�?���
�q)g%�Y�znb8���`2�a_Pf�A�����5�bM���:���:�6�9j��[�����P��������Lq$��d"�:��t�b�X�f,����Ce�C�D�m����n��~
����z�w�y��b�^���f5�����3���b��pbe�}�.�)�����gX�3�y���b�\{^W���.��\H�����j�W��5=��~m�h��Rq��rpv��O7�`[06��R3E��C�����t+���t�b��r�+��z�'�f%�mW�]��(a�&c��LwQ���������
E�6��4��b�X;���\��qXf$;�86o�a{��`&����'#��,q$+y�e���mr�H�M�al�d,�yc���<k��'� <"�~�e�.[����*��`a��A�����Uk�l�d�T���g���k!�����tF
+I�NOB$�{�����u����G�2��U���e�D�@�Gwx�`�2��	y�x�@�Gt��Em�nV�i�L/!"Z �#bw�L�H}t}���6�"���w6����Dj�
���#;�H�z����A���AxD,V�
��������l�l��	��W��i����=�����Y[�AF �����tF�����:���5y�S�����MWo6����U)�����B�i�UviK��oP<�`�k�0PD j��M&�7��MR���=q�����o�177g7���sss�������3�<�F��&V����v��z�;z�;�;�G��mSuN-&7�F���T�����,EV�U�5����5�5�38(C �3���aaa����I���uss
�*���w��p������v�J�joo?~��\.g�@�'��b;n3U�a� �~���W�{=���y������5�"�f=�]U����JE�Qk���f��.�N_��+yp||�����~����<���}��92ixx��g�����Tww���QR��[o�x<���q2qm����Z�G�%�3�0��NuQ�Z�����c��a��G�:�8�����>�y�b�����III���Vkhh��������=j;������������+88��VMg��J��:?><y�l�����j�������<�������b��y0  �����E9r�v~�S�z�)��H���_``���d2��s�-!�P�~�����]��
J2���;]t���J*c.*&2V�����I����S�_puu}�����������{����9A�B���j;��������������QQQ���d����H>�FR UH�
I�p�s�#�]�����a��9�kj{5���E ����������O�&S+++��9c;������ZUU�|������n��iA9;�<8"���b]pvQ4������dv����3��Fl� ��/ "&X�~���(�������nnnd���M�M���tss���x��7O�:e2��j��cbe�e�zX�����yB��#kd_�Gx��j�8,nD�'�bpV�s���-���@�`1���j�����������n��V����6(((>>^�P����^Wt$��-kc8f�L�����6nx���q������(�M�I|�A�O��h��+�OX�d2	/���������t!e�\|����S[�������bM��I�8(�<x���M2|0�:6��5^{�j9���%��&#�p5��
��#�G	�����<�,�d�~�|9�*�?���N����@<R +C��J3:od��H��_�\sj�Y2�����3F=o������'���X-�����f�;�9�������������)��@ %�����lq�������w�.�(�Kqj1����<���1n�k�_@ ��I�x9�7ua=���n�A��<x`4OT��>�Qt� ����������:R�^����("��
����b��g����:TypX�Z�5��~-D��YT:���r&��E�Bef}��@�@�w��4��O+�n�f�-f�ek���d�������r���`�J�H�����v��N-�nQ��4,�1�,���#�8�k��������'0��c�,�o�<��,KqZ������rvA9������O9�8�>��y��o��(sw�k�������#��]�nVK���s
��
,��x&H\|��4O`H���11PLk�s�8H�<��4Ua_Jb{H���\j�&�m���/9��;���������.�O-��'
}S��8��#�����!���)���:�7���??��P�j��������	Rw��)����]�fL�c�'9���������%51Pl�����n#�;�~�NU�ET�l@i~HU����$���u�7f��o�t&H���V60��B�}G`��6,��(J��e%��F���������ke
�?t�cfy��aq�*�4�	R����������X3�p�R!���S��N�+�W��koF��Y�!����U�"�j�B��:L?#�rS]�����Q�������-�42V��--#��a���h�G�X���3�O��wzcMP�����5�����s1����x�yp�
J�R_�\�y0 ������~!u�v���A�M�,��X4uKM��&f���&��!M�l�Qg���M�����2�H�2[���L�����8���Kqz����Y��:�7�s��~bb8����Bg$�[���|��(a����o����_��W��n������e����q	���>af������ ���Yo�����}�a�x��nd���Z��Cwm����*$�=�5��������fu���~��U�DY���[Q��R=��]���5�{sX,�����m����0Td���L��Xt����<��X9?�?O��a8�5�M��N�������(�
���e��a��v�*���c+�����s���d4�JEt�R
��.����9��x�6�7������%�Bm!������m����N-�8B����A(���R�������lqT�l+IRe�YY�H�����6}������	5��nM�D�.���6��Z��fX��KM��Tl�_�6��~�y�������O?���I�������!Q6��8^�8^�����_��>?�������C�s�P��c�8�B��#�������ktn���8���f�Wx�������Br�+&]�QNw\q�u~��;=	N-^�����x�����U*�e�:��r�����5g;��g}P��lE<0�W�~`�`&�yP�BQf���9���>v�l���rj�	��e��7��?��3�yp-(g��x���������j}e�����T��b�b4��@�|����������
!c�~����]�>����Z�?�k�~��	V�k����.��8��O��@yp�'*#���e��z���R�~
C��S��A����<l����h�ML��;�.�#,�-�H+����2���@Nv/[y0����>�f��_�z���yn��p�@�m����-H�B�@�Y�q��OiR�������e�F�������y/�(�y���e]��������wZ�n���]g,���k��M�a�y���'�B����J�&}�7����)~�C�P#�0L-1}k��x<��?��TP�u���<X!��-��):$3%7�3�
l��X����:��`y���S=�h�c����2��$I��]\Xc�a1��������[��B5�V�+������cz^��=������E���)����<���%�db��I��U]��&u��^2gtj�n�6�U=��q��{)<��Y�p�q|y�$)�L����]��K�#�\��������$s��i#}a������=��A�Z�,b6/���.v
.v��8�X 9�2w���?�IId2��-�Gz�!����\<�N�D}Kp�����{��A��`h��/Ww���{W1|���>��gW(�	��p_j[mRK5�q��y��-�u��?&/��*��`M��d��^�"[I�<��]��;fc�������Y;��R������QC���t���;�Ei����B��|�#�d5�]�f��L�Z0j\U�W���%k�]\��)��7��n�os]!���B��6]X�~��K�:�Q���R���C]�y?Cu�(���3����x��,�y#%5�Z�7��b���u�����^����G[���������L�[6�p�T�v����)���}v�*�4�*^���~Q`+ypv�R��)
�C����<8,3�'M��H��2V���J�� u�]��������5��Gu2Sl���uu��v��1511q�����~�s��\@@���q+����'N������sssv��u������������3����9\���Rg��V��Lq�-k��m�9s���+*��������������nnnB�P�R���;.\������nR�T��������[��f�_5[,Lvp��U�����>� 888;;���s�fxx��g�����Tww���QR��[o�x<f��L&�/�������C*u:���Gmgs8���ct2���
f|���b���������rI=EQG��������z����E������L&��L&s~��g4���'**�###�s�
�����vf�S]]]WWWI������x�-//���
�k���O�>M����<s��������kUUU����_njjb�����/����'���GQ������g=<<������{�H_������nnno����S�L&������$���������������Z����������x�B�RS�,66���...o�������S���^}�U�g�y&..�������nnnB�P�R���;.\�u�o7�T���������r�V��dgg�;w��=<<�����:���}tt�T���[<�������>>>�o�Nw���]�;vL���J//���`��`p�\�7EQG��u�SO=EQ������������Q||���."##�s�
����u�������������QQQ��5��hnn>}�4�������3�N}������H��/������Z��������=������F��������L777�7�|���S&���F<<�\B��y��T��Z[[�P(XZk�'��d�����d���_�����o�&44��E������?tuu}����;�������n~~���|���=�����kkk����m�S��3���_����V����������~����))),zxx���k������?��������'&&X460l`�377w�����{������#���g���}����K.���g?��/})''��d��>��}������������=�T:33����;{0jKK��>���������t:����o~��W_}����Z����J@@�R�������~�R���M���jN]��j}���222�V����?��#�F344��s����8u�����<����������W^����5MLL���hg��
��~���������:���(�BQYY�����=���w�������MNN���(**"�������E�����w����q��jBBB<<<�*U*����S�k�Z?��O��j�wOO�?��?K�RvSr����s�?��@  5�m�����u:��j}����[y{{���8u�!!!^^^V�U��~�S�"����O?��S��
��`s��G��_8�<<<����W����l��o{��e��G���~���Q�R���?����_Ry��9g?�">>����v�B��������Z�_�������@ x����{�=f��������}���*���M���^*++�Z�����P($�?��Om�U�!55����*�������#2��j����}��_t�r��as�r�V��'D"�]���g�������������t�l6��g?KMME|�����������_����_���������;������S�N���+.\�r�>>>������>���:u�V�5""����w���kRSS�~�i�N����>�����t������/��?��/��/������_�����KP
������������/�������������.60�.�j�FGG����;w������^�p�������KKKN]�B���W�J?��j�������K���E�R�����q��':�L����������k���x<�0���|>��F 4662�tB$19pB��WVVFDD��q����b�0�t���588�6A860l`N5;;������L����VVVbccmk�juLL��'���lDDDBB}��&q�r-KQQQpppGG]i{��A�	��������y366V�T2�\rg������V�������/|������/���}O���z��._�|�������^���#���A��l�H$rww?~�xff�����_�����_�������cg�X,���;v����������������������3����5���_����:�;�&3����(z8AGG�O~����If������c������saH
h2�M~�������������{�T������j����[�����r\\��c��������|�+N].8��7����0��nkaa���+++^.�����`��&������o.--1��|�����f��W��Uxx8IMf����"���			���+��^�J.����N�"����KE�I~���'&&2������'O�F��&3����+_����(]SRR������!��9s���2:�����~���������d�_|�Er�Z__�s�=g7>j������}��V�F�	~�����\xEDD|�_�����b��Aro��W��5���g��
p���d�������/���8q���]���>>>aaat���W�x����Yg/����"���
endstream
endobj
22 0 obj
   17848
endobj
14 0 obj
<< /Length 23 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yt�}z���X���:��H>n�8)��qm7�c�9i%7vm�u}���VS'/N�>[�,A\D�"%.�%Y��wB��}�����H�b��C]y���qfD��9�?p����?��wf(
�L�R
0��B�����j����x71�=����dS6�������nnn|>jjj����V�MOO���qss�t�RCC��/]��ggg�9s������/77W���K�fszz������Wnn��bY��QQQ�O�^^^fk���477���o�;[�Vooo�Px�/��<������N�<������c4��<��3D��_�p�������'X�����UO

�seii������������Wii���wuu�x<�D����](..��xL����1����x�/��<XZZ������6�y�x������a��<55u���788H�������N�>���k4���===y<����L'''�on2���;w���{�U�K�����r��V����v��~�};����s<�������xIII�����������'TTT�x���LRtuu-,,$�]]]�g����x���a���G�������OO�S�N������������'O�:u*))I���/Y^^.((�p�������[@@@NN��` K-Kee������<==�]�&��l�E��[�B���������M�������111��������������/��h�M��jvv6&&�������������m��h,,,���wss;{�lVV���&�F���z��i�������������aaa>>>iii�f�l_b���~���LMM���xyy�OII�����^�^��/qss��x�}�f����]�p����+�7::J?���|>?//�<�������������'�CE����p�Bddd^^�����KHH���������=s���KKK#�7����			)((���:{�,����rssy<��_~������y��i�'�H������������������+###++�������y���^^^W�\���)))Yw/��������&�����wPPP~~~PPY#�)w��u��l6��	

��������x�.]���^����/��U~~��+W���k����|�r~~��k���H�b�<�����ogaa���S���B���� <<����={��w���X��KO�<���lT��j����I�^^^<O���O�j�d;I1"""((�<

�~�:y�����K��U�H�m+422B�[Rlii��x��������m+����&)����x<???R\��f2�|||N�:�R�H������7��'��������}]���<�vGo����4�GV���7����������U�7���lw���������X,$e�k

5��d��_�������m�`��U��E����D���(�"[%�J����8��C�6y<��'H����������X,<������������;u����$EQ���<O,����i-gffH�t1�x����U5��|qq�����	�p�����9mUKK�4�6�E����x���V��			�Eff���'M&��K7�$�_�������x���n��B�@`�y�c��/�>�vj``�>� =9�t`�x�+W��>��%������H$��{hh���&��?q�DlllAAAdd$�C��'N�x<��[�V�DI�@E"Quu���"�#�p����-��lNKK#c��U�i-m<��fdU����KJJI�,mjj"?�O�8XQQ177G�vUK[UUu�n���
��������t��ZoY�ms�j#7��u����kyxx�'�CT�CH�Nwy���z=9���	�/��<x���` G��8199������k��8�(*;;����'Ovtt�x�S�N�~���F����_���5���xyyM����
�~�����L���/�V.66����t3��D�T�����'N����D���%�o���F�Ju��
6i�|q��]wI�-,,\�_�	��em����A��M�����l�`��U}�w�������f�L���N�����v30o�g������VTT��������Y2�J�o��� G�=��,--��|��v��*`[��<H���]�FwB.//�'��������<u������k[���:�9'k�M�����xJ��n�w�&m�����n!�Ob���
�}d;�rqq�>��m�+�=�j��6Q������������.2G�������xp�T�����EF$m���S�#""�}yRR���'����Y�-�k$�����Rz��������J999��
�r�ESSSd"��|B�\^\\Lf/�M�����O}b�M�8�q�����8�WVVF������b��511���UWW�O #k��m����=�4:2����B����<x�NCC��C�[���~��uG�'�B� E��L����������]�~���Drqee��w��\��Y�]�klmm%��������5#S�I�[pp�����b!ihdd�����J�c4�������H�Je0���������;�A�����SSS7���7i�<x��]w���]]]=<<���M&�R�$W���6>>���������n2�zzz����/��jJ[[���'���H��(��1..N�V/--edd�3��>�����j�|��'d2���I+���+���N�JOO���!�������KKK�7�@ ���'S�N�<i{�LX,���/�?�v��������^���b��x���kyy9I+��e��HO���{MM
y���;��O�V��j�����4�;�A�{�L_���7i�<x����666���&
m���<<<��d����A���O�<Izz�����K\�v�^�E������\5M���+ks�}3�6)$$������$_��'C#��=��Ot�Y���N��������^VVVSSC�������������D���R��LM�MMMUUU�	'��
�lii���#�wo0T4<<\\\l{7�um�Ik�w�J7��
�P���d�����������OXXX���)//'�|k�d�����������V}�$�544���������U��v������v�ZmcccYYYUUU���z`�Z������+**�=Acqq���������~�'C��b����UgU��^�z-���)d�mc\o#<�fsmmmlllxxxjj������j~~>::zbb���Z�yyyUUULl�L&]YZUUe[)������9::���������C*-K]]]tttDDDFF}'���o�h4&%%��a�������"�Zm6�
Ett�J���*���X�ZM�R�4==��f��Hv���������l������������I��411C�J"����.--��������D���A��JKK����B������t������E�Rutt��Z����y�-��Rijj��l����s"EQz�������ill��%���MMM�������oR*�---mmm�f�Fn��fs\\���#��nRUYY���F�

MMM��/,,����������^uh���D��������T�<���������X,���}}}t�gkk��'[����������D�|������P^^.�HRRR���I�X,
������������L&�	����T�DRRR����jE�A�R������/�������(J(.--)������!�������~�D"!�������iiit���pVVy<==}��
�N��!-<�z=EQ���111KKK---����K.�!::������������<6��)))�����
�oii)(( 9.--�N.���r�|�Zd2Yvv������S���)
gff,�Z�...&zw��EEEeee�$++�����izz�������������Z�B��+���'?!t:�P(T*�w���-//�������d��������h������,�PH�sR���@'����\���*����t�������������n�WYXX�������S��j���$Sh&''���D"�D"������njj���������l@LL��l&��t�2� ��L������G:���\hhhnn��^599I�ZZZrss�c���p�<HrVSSSCC�Z���6A��/�>�������USL7�M����D"!y��(�Z�z�Z��������[�����hjjjHZ�������?�����������UG������L��������U�&>>�tZ������N(��&����HG(1==-
<vtt����������������<���������E555�FRi�Z���������������E2����� ��ihh(==}m�gSSS~~��j��������lI�222�����N���088H�����ggg-�R��q������`���koo7�L������d ���V����#�������������PFF�����bIHH ����555�ayy�����ynPO_ttt~~����~Q��S\\����e2���lBB��m|r�:���������<r�@|||XX��7�������l2�������.����OMM���������~7�����P���>==M*�ZmaaaDDDTT�H$��oWO���
���<p�������C/]��g���;w��g?����(jdd��^>11q��A���~:$$�T��������7����`w���\.��Hg����g�}���]�V�>}��_�(*))������;|r�Z��������J�������T*U��}�����;p�}����|:�����"�xiii��]E=z���������j�z�;8;;�����|��P(LLL<|�0������~�����HQ���N������y���~�����G?rww��o~��/h�Z�����[�������~~~���sww'5:��$S����<(���{����������J%����m��;v��f�������3>>>|>�����G}t����e6�����rw�*J$����k�@)�rss;{��m������<y|���k��	zLP�R9::2���6FGG����CCCt��={�����w�}7++������zaa!y��+��D����W_}��8p����Ot�H$�=����G��FQ���W������q����_����EE���;99?~������o��d2������:~�����P(�t���NjUUU^��WVV�������U/�(������744T�R��R���/�X���d�Zf4���)��7k�,����V����b�L�%��@��=�7��L�&]�����n���]�t�5��lT48��;�#��xup���b��X ���,5\`����b~y�PlEW�`�����������M��bI���� ��Y��j���rI�`�����P�
'y��7-Y:���v�`�������z?�fR��Y���g4S������L��uC%,�A����&��L���3Y�d��.�-&v?c�-fV3U?\&V��o�C4��tQ���V��M:�w��<�2^_�(���s�p�@��U[,f���11�2��2��`��P�(�����_���(����)�Ia��<^����U�NHJz���[X����(����Q�w��<�r,��������&�@��O���>��ZNvyh\r��Dxhq2>�9��TO�X���<�p�n���/rp�1����~��A	MWUZ%��<�8�ynBxHp5O�W�6�4j�Z'�F��"k����0ZLj	n��
��p�������V�7��'Q�<lbqdbq��bh�o�L��	y����� �J�t��e���b�����d?y044�~�R��������������j������u.B�v��o��������<x���Gy�<6�L���;t���c����(�
�O>���c��z��}��Y,���]���< �pyIoz�R�ys�`(8������k��F����������YYYo��EQ(**�(�j����K���.]�M`���*��"��)�M6���..�����B����vr�\#
N�`XX��l���@ pqq!�������oQ����R�H���~��U�wX�t�7&�v�'���i�Xz[�XQ�t�e�V ��OJg5����TyE�z���F�=�Tn���� ���9����c����o|���;v�}���������]�t�7�M�5�,����d�!IOI-!���7���j$mW�PIY_&E����!�����:�n���E988���nC���v��y���]����2>>.�S#%�����(��Kr�33�����.J�
���cbcp���\�:
������}}}���E9;;�����?�P(��v��u��������;������'�xB*�����?���O?�(���>����V��R��������k�.]�M�	����C�(���KNNN�v�:x����EQ###������;�����%�.]�&����23��0�m���{\�-��f:8o��M	�A�'��-���"�K ���� b;� �+�A�v
�A�'�'��f�<��d�,���`�b�� ����y��@lb lu=J�d���m����2�a�������L���(��.���t����q��,��lW�L;y�?���le��<�uY��%�Bz�u���6��Z�Q-�0],�I���d3���$���������W1?�5��7���%��������Uy�ug��A������u�e7�l�xq~y��e��F�����\w�`aawr�diL-��e��
��b~jkDNg|f{tF[TVGlNg<�������k�J�:�W�xE6�#B�<�f�<����N������-f��#��y���#�=*Rr.�-���lF�t3���f��g����9#lE&�Q ��������(��`�q������<X�(��sF����UgZ�Q�r�v�Cp�#����7��9�}v�z��FKpjk8� �%���N4��dw��#w�Ex��@����8�������sW��.!���1?�-���9�=�iLl?��i��h�b�sF��u��:�I������`|y`��vM7/5�P7T��9�8�hqI7V����y�e�y2,��l�XQ�<�I��b�d�|bqdbq$�5��0Z�hwM�p�|�Cp������*ayg��[����j���L�.c�(V��������������/�����O�KG�����"���j�L|�U��O��60���$Hl$��h1Br.�5�n��n��\A(E�N1��2� �Vd��8o��W�d�;���A�I����1_ >U5�W5��t1��tV�������	M���s�+��E���b� �V�<����;%��t�d��;�R�at�!����hwy`k1Y�}3m��r��k{�����Z���DR,������g;����;�8o��� ���b~X��9���fz��]"�Ac�����<� tFmnG|F[4�9���/���0#�w < �l����G�Y(��evL7�v&p�f"�_ < ���S���)1?Y�2^�2^�t1���t&y�A��<�hU�PiNg|�<,�����/��lQz�FK0�
&b[� ����y2����5^���W�@0�����D5|\����U�y����`^��������H�h��Zd�%r��\E:��3�^e�����<��#�9�Lf����������C�����>k��&�2_v4f������3��%�[�-���tL5fu�fw����4y���Q�[f:B*+8�X����S�=\}����������v2>x����$����/b�����Y��%~��r-�
	�fo�df�n4]����A��>�������1����V:A�����t�����s�b�s&Q��<�@8Ln	���(�H1�3����lW�P������� �>'�������W�����<�?��?������E�>g��F�[���d1uL5��F&��zg�I�`�x��,�����7��7-V����.����<�N l]d�L��8��G �W��D�d��bEvVGlF[=T���N�~y��@��&�b~H���������I�v�Fy��@���r wd��kEau���RF��>�%4��2�-	���*�N�S��1p*����N��ZZ���j8� ����L���=�C��!Uo��bE6����&��T��y��`"����r��=_�5q(V�����<������;vP522b[��%tppx���CBBHepp���{�|����	.v���L�u&�'��|����ek�l�E���\��t���.����g��������Qf�y�Vdd���g)�JJJz���o��OnR������?��R�srr�J�j����>rqqaw��{�����f;8o��!�j/J.�K��/b���9��<����������BN����a�����/������IG�����p�Bhh�Z�^�Lgg���>���>
�����&5���?��X�|�� ;���x�L�������I���9	�������9�I�T��-��l��p~�*�?��-~��GB��<��O�������7���/��������w�������������s����I�N���k��\Z�����B�<�hd�	O&�s~��~\��g�}f?�j��2U�|�G����R��c�����<�9�(���O�T���������m��c���L{zz�9s��������������]��������2��^s�X��f�����kb���@{��j�S��O����������`3�f(Bs�����Y�����z���#G�]���F
i���������#G�]�&�1A�J�����&����lTT��F�K�yas`ng�E���:!���a�`�����H�0�D�|��T��Z�/��a{�
�r<<��o��v^^]��gOww7y����fee�>���_/,,$�_y��HT]]�������������g�'e1?��������[)g�8���dp���E]����S���U�h)^����A�b�������;::J�^��o�>������������,���wrr:~���o��o�>��d6����:t��q'''z��g����J�[���v����NH�."��sd��`�����}||L&�mMee���o\\��` 5t�(������744T�R��R���/���v�3:����d'y0M�&*�VV�f���=��)
�����a���<���`\C���	�����p�|!������1?�I�y�b3^�WV�y\��8����y�7_��� l.{������e#�_Wr��X�3�*t�"�M ����_��7#�-�a��E����Z�MI'�g/���_���T.�7r�v!�����8��=�Vr�����EQ��H��h�+���$�5W��y3b?�-��K4YR
��\���+��g ���*�v����E
������$����cea�:IJ���6��.��_�_Ha���v$%������b�r�H�������-��3$H�l� l.��{���9I-!lN�LmI�����r���`"Rj57���q����}�X��kV������za�&������@���U�O�����3����
�X/�t�I�b-�gLU����=� l:N�E;���Z#b�BDlzN�XN�$D�:���y6W��=�V6� ���'y�o�T"_�����~�aS-��f4��A9� bs#�qe&�B�A�����4�����d6h��5tZd����I1�RM^���M��$Vk�+�3Zl1p�pjEy�������������B����J��Jf�-��vf���v&du�"n��j��=2�A�T�a��V�28����~�6��k����+�-���4�:F��~]�de���~sd+uY�N1mR(ME7g�6��J�����XE_�����z|ji,�I �;�����+N.�&5���f::��Z'���b~�D}�d��	�@|���Tjk� ���r��bb�&�ZCD0WLi
�����]c����c%�u���M�(��dH���8�Yaw���U��CJSb�J>�e�?���(���V)�M�S&�`�8��p�Z#f5S1?Fv�=�FDuC%=�V���SMau�mQ$SDJ�GJ�ew�1]��}�����������N�7��a$�U��0<c�5�����+��.6/����]�������9���gL����i�P�����w�m�0:�����!nOJ�DAwRY_&��t�c���/�ey�.�3>LT���n��s�g(v��xexE>t�L�!�iHi"���"�9�N����y��#�w#nc����*��r�@^�<��$ZU�������#\�hG����mp������YN�����G9L��s�R;�'�Wv��W���e��9�v�\6``����%�|�x}s��������t$j���I����yn�`^���(3����h1�NS,�r�V#��������i���u
�v
��dy�IP>!��+c��.BY����7i�Z�u=����)�B���WW��x1^�����
�B�
�Z���E�(l:�z��<�,���lT$�������XS���9���UW�qk�9���6]G����� l�3O�a���<��W ��^���b^����7���L���A�6�g��y0G�rh�Z��l_�l_&�+)�:�%�z�H
���h6dI9��9]�E��A�4����2C��B����+%����zM�H��x��[#��my���"�D�
��-��;^��3��bA�6K�2U��� nW\��)��
�������|gRU,���fm��	gy#�M�������z��"M��=�JSG����+�z<��zngIOUy1?�-���lwt������\�U�*_b9���<b��ve���/G5|��%���Ol��Q
��C�4y�q��\%��r�YN�y���m?G�iE �����z,��lh�Y6oE'��u����d������b��W�����&q��/#�����jI?_7T������yt�J����roN���k����<�]�y2��s��L��-�����f2�=����!"�-�����n.���Q�������7��"�����E>�{�!�-����.���=�J��R3Y=X�h��'E6*
������}�����{�f}}�|���"5�z������ �fC�lgeN�X
9�N>Q/�g��:�����&�h��ND�r��k5%���I�Bi*�y	���Bib�X��+�/�pq�)�<��%�'��Lm��%�pu*��4V���K�;]�!)ip��?e����\=��V�A�
�z���%�i,'�d��K����	��`H�"�yp��I�'��� �:�b~F[�I�k���Z'���8I�^�s��l_���}��(��y�i�R�L[�D�y��{�-��J����#Q3Gcf��/bf<g������N�y�ZQ'� ���@d���%�����Yv��l#�����X(��4��������@ �-�Y`oy0�Q#���L�V������?e��b�X��y���y�i��d�g:��d���$[1W,������d?������{��bCJS���o�Vnb�8�;> ����$z}����TVWW���GDD,,,�}���*�*,,����������H��y2y��}3
#�1?��r�L[�Ls���@��\u���GY��Y��<�#�P������_�������l��P(|��'�;��[o����b���\��d2������C��sttLJJZ��D}X�o�P)��E�|X�oVGLNg|jkDp�Od��k�1W��K	(
?!la�x0�Q[���7��3�5�a������qqq�����EEEEY���^zi�!���UUU���'K����x���+j]9J���z�T�e}�L���h�:�
����YdM1����N��q��0��������������{��������(���Q�R�'|���W�^�}������������}�[�Z�"��L���o�����l��1qZk��9�N�`^���y�@l�x�`xx���??::���|���w�}���;v�}��������/Y����3|>�����o|�kW���cy�&���:6���y0K���hrd�(S��MR�F���� ��C��V���/���(�^O*�����?o���K���I�N���{��7�+���9���u�l��8�����)e������oV����2WL.��Q�L-���UJ��������������|�X�R�.Mgg���~R���
�B���]*
>Lj�����������A�D}fk�oN�K�4W���3K������Q#���\�{��S����xP,�{���t����~�)EQ�}����G�Z-�J���l_�v����O<!�J�j��L�dN�`zK�@�?�'��z���[:�@�6�<HQ�������v�����j4��FFF^{���;w���7((�<��G!�]z��%''�]�v<xpbbb�Z����=n7Z}r�<Rj8I������
�@�>�<���`Q� �n�8;T���FD �d�M�\�����~�_�F���/Ew9�\Zvk��l����.a��<�h�4+l�X>AS��rsv����}DY��?����JZ�7�i�.Z���)3/ayH�AN,���|���[4zK������le7q��<x,��Tz���F��`f�&^�)l����*�GM%mF�������2�M�K��_Mi?�g��uFq��l�Jc-�����L�
'y��^*���&������ZMeNig<Fg�����jk��z���6��*sr�����']7�h�Y�l����b�rp���b������ j�v�?�������(���$)�p9���f�K��_A�G���<�����Fd)bD+mW\��h�M��R=�E��:�d�v�s�^��t���bU��d�J��G�t�Z#����ENp���5n�^9�p���t�J �'���(�jE�m��.����T"n�W��V�{������U[��,�-*���y��(������DW�d��=da�x�P�n�a�8��^������{��uN5
�S[#�����"�.�TV`��Q��J��;���t���t��?W�o��t���g��\��#�5������e���J
L[
����[��t�Qa��9���5��M�l������O���y�Y���,_���<�!��[�s�������)�����~Q��M0]��4p��\�eJ��
V��R 7��x����1]���C�m���r\�%�`��,xH�`z������F�����ZM}��\�z`�t��@�e��B�I�t���bu���}kM��1��Y(6)�\L��`����m,\�3v�����Eq����b�88m�����%�+�)�]���`��<�m��<V�x��3�(�__����]E����E-��A���<�mL>6YS�f��o\����Q<(������K\]����3B���L��y��a���V,Ws��\�iV��5@���)SE��{���`���V,�r��*�8��;��V����a�'��0Of����MS���S�����x>;��y2��d���i#�W6#Q����������Z�/��\1�������G����d�7{���3��%���Y�ry��
��h�/�\�ry`K�<����<�-�����l�Q�������{������e�<9�o�������`�Ss���y2�"��p�@/f�xpHi�5��d'y0�~�:��lf7���L���9��V��`m��<���c���2��S��f��/57n]����+��z��B���F�&]r�j��
�E-��v�|�����QSA��s���#��Z���P�7i�2���C��1��
�O:y`[��5��x$�,��b��a��I�2&���}�/����z'��������r6��Mg{v������Z����X:��,��Wn�����&1��j���M���e����}�H�����N���6�����V�u�����Jc=��+c�S�8��aS��l�W�z���������H�M-`���U����K���y�_��]��u+����+��q�zz�N�`U��`Y���;���*1f7��0W�(
7K�U8��)��w��\���`�W��*
��{��`�8��CJ�����q�j�|5_�b|864c.i36�:��u��B�����M��z���VNa����Y��,Z�n�$�����ez����_8V�a:���Y.�'�J���[\\Lj�z����/���������XXX��$,,����������*���g����~u����������E���~����]
�O>���c��z��}��Y,���o��C�;v���1))����7!!!�&�������oRu���'N��%(**�(�j����KuuuUUU���'K����x�
�6`�(���z���_��W����7����{�FEE�z����J�"�?����W�
R377��o}��mx ���>�q�EQO=�Tpp�^��������d��;vX,�nn������}����JQ��l��7�������e6���YIqwf�X>���������;`[������������������I�N���{7+[
�	�{����O�5g��]^^&��9r����;;;�����~���&z���������������_���������.����������]��g����T��R����qlllrr��'��J�j����?���OY�������(jxx�����v�����}///�<�,�(jdd���^��s���{���H��K����v��u�����	�v`3MLL<x�������		������{��uppx��7'&&��j����Mj�������W*�/srr�J�j����>rqq�n����s__y������&&&>|��tww��?`}�6�����z=y���������s�������N�k�.�7`�����l6�����g���x�����'5f���G]������`#44���v?�����#G�\�vm����T�T����o2��y�����W^yE$m������W_%5`}�6M||��������~��}���L&�����n��l6;;;:t����NNN���l9���������*����yp��J�200�����#�%�����_���gee=��S�Wu��9��;11�����s�988<�������q������|��w���o��o���&�W��#�<��Z�f��LNM�x���O>�{����{orr��Uwwwdddh������pppx��W�^oRR�o�����c����w;;;���,�����7�����z*))��d�[������dq�dmc���o�x��������������������?����/���������:��l6�T���'''����~���������{O�V��G�����m0������^ziaa�����������4M@@�o���z;;;�~��/���_��_�����U�V��$����'�x";;{nn�l6�t�����?������������g����W(�������I���~���
MkM�]qppX^^&���/V������6�n��]��u��qF��s�N�FC������CCC,�Q)����?����������������E�������J�^��������z���y�re$�^��c�1�������T���������z-�'�|r��i�e{��B��Z�eW�y����*�Z�������H�������bs���W&���<t�P`` ����gO{{;]�H$/�����G���***��������V��i����~����t:�X��;��J�\��s�1�����C�������G}t||������={�0�����C����J�NNN��7$$��������������f�[���+4Y,7Yv����;���_��_�?��w������g�}������`H``�����������|����������<33��z/^������5����=�kT����c���w�i��V�g�}��SO��x�����}��|�;eee���`0��7�qppx�����_|�EWW����'/^dt�sss����������www?|��O<q��%F��R��y��2�E-,,����������,��,{�V����/_���PKLL<w�����@ ����������*�Xl[#�H���XX5M&���g����{SBB���,k�&jkk���l���c2������3g�B!�7H
���h4AAA,��foWh���x�bXX=����X��^��j������khh�++nbz�����+W�/--��R�[���������������/~��VK���-�z�Y/EQ�O�~��g����^x!55�T�0oD&�9;;?���������������������y���i���={��;���?f�����v{[��k���*


����B�`�}�z�Y/�OGG�����_r���/�;33����=z�T���wTT��\/�`jj��?�aAA��������/���k�(��z���l��hiiy���fff�n��^v�KQ�c�=F��`�X~���]�p��<�s�N�NGQ��s����H%�kp�^`�\.��g�F��������?<<�e�H$�������>�&;;�g?���3���z)�:p����79#��(�������'?�	�y��_�����ikk{���V�`�6�����������mk�>�em��������|��Y�fvv���C>>>�������^����w/�'-r�^�����i��
endstream
endobj
23 0 obj
   17696
endobj
15 0 obj
<< /Length 24 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yp�y?pOd[����S���q��IG�k�q;����������M]���_����"��H�"%:�H�x��M��o����
�7��sC��)��]��~����cw��}����.ESL&SHH��+W6>������***��+��&���"o]T�T����qEE;3y�������t+/
����6���&�����9�'����t��OOO>�?77�i�}H��������{zz�?����a������'Ozxx��z�Y�����������[TTd�Z>?..���:���pFEEE����j����\�������Z���mmm^^^�\��<==���n�J�����'
owN����x<���wnnnaa��d�Xs7����p��9��+((�_`�����^N����J__������9__���J�������xb���%p"sss����JLL$?[N���V��F������[�ljj���������;m�����x���[���JKKy<^pp�����b����x�"��!/�����x'N�4�L���>>><�n���� �322����7���O��x����*�mIJJ��xI�(���B�������~Yuu5��S�����N�7�S���;}�4���k���y<^zz:)fff�x���Z���]��xyyy����QVVF���yxx�������x���l-
�=F~����


g�����(++�X,�w?|�����������gee�L������/{{{?~<==�`0�����JKK��;w��1OO������B��x��\ZZ���s��[D�����Ed>WVV����;v��������U��^�r���|>?**jhh���������?~���v$������u3944��.22�~Z7]/sss			�������N����\ZZ�_�K�.yxx8T:����p���OY�������OOO�K�.I�R��B�T&$$���������_��*&KWRR������A�M&SYYYPPY���|��&�F���u���c��EGG����A������>v����vv�F�qx��[����u����ktt4::������caaa������7]��t#r�|��xzz�x<�6[�������;G���y<���$��V@o|>����<.,,����������7}�� "���'O���������x{{��)y�-��s������x���__��������'O�x���l�z��B~t����������N���x[�R����x���*�z�"���z+���|�_W\\,x<�����M-}}}srr������x<�}�"���'O���$&&�d2zOH�B�_WEEEDD����� ;(��lkksww����^���������q+�eyy����B����4&&����:u���i������b4===������"�w��������<���4��~~~���%%%�����#9����,�������^�J�E>'**���$66��/:{��g��~�����/����A����p�BII��K���KOb��}�u�PCQToo���Yq���d&���V�������~d���S�V�y<���)����x<�VK�@����$���/���������p���d>�M
������*��������!)�b���I{��&&&������y<^LL�����`������+�:t�����j�������f3@������d;66F�%55����|BMM������0W�E��������NOnll����������^/uuu�-��H"��5��g���Doo/������'���899����3;;��D��F�������,����}#�H$r�Q^^n������D��V�������QQQ���f�����n����]�l6���?~�^q���~~~|>�������6f�v���������x<wwwR�����xtk�z��@R������INNNJJ:~����,EQr�����D�[���O�W���J��F���� �sG�[x<����C��3����������c���-�===��GC&188H�����x�����/6���/_���
�b��nZ�bJ�Ztt��GEGG�x�������XL����n48A�R�g�������������\EE��1��b*�J�������9#$6�b�t)))�/#G��-���%$$����P���w���!�6p��E����v� ��a��(������x����.�����D�ww���������X2�������x<���l6�DI�@������WVV�����s!!!V��b�dgg�n�M�����_eii)]C�zop�y���$���M9|��`�z�jXXi���h&���x<���iuhg��<hJ��` �\t���prr��3g��c�[RtHw�IK���
�)�on�^�F#9�"G===���^://�}!g����hQ&��/���=,,���k����i��sX��W1�Gf?E�g����c���~3���w��?�|K����/����A�f7u���M��V���,��� EQ��x{{�b������;M&����9�%S�z��������gU��6� [�A����/���l6�&~~~���UUU��m����j�����f�l�q�MT��������] �������]�Kz�y�t���2���Cn�^,�T*���%x<^@@����C��r��u��B�(,,&�������M�����MW��KG�nYY���;y�6���n��
xzzn��	����y������%�M���ks�R���OLNN^�vM$-//+�J��J�"k�n�!G[�'�����|��v��*����w7�[����K�.�V:����F�B��w����<H�t��C�?HZ���|2h�F�u�y������9�������xMMM��^h&������u����o������C��,,,�����@F�l�o��7]:2�����pe�<���ry����a���~�<���������.n�6>��H$�-��H��k%��_�|y�����{{{�ys��*�7����ft�?="��� ��l�$Hw��-(44���MOO��R8 ����E}}�������fV��\��n��"�|c����L����A/t�0��d?Os����������tzz�F���ay��C��eaa!=�d�<x�U����SD�G�����
���[����_@z�n7��?!�{�it7��o�����������QQQA��l��[Y�w�lZ����h�X�XY:�����+W��O$;�
#����mqV���y�\m)))I�V�������s��8vvv�����F������B���	���x<^KK�m-�2	__��������������I_2$$�����������R!��r��srr��'�MMM
��KHH0�L:��a&�~&55uuuU�V��|TT����V�������R)�p��h��G���^��j�������-�.��T*�V�����7��7]��.��������c�����f�B� �����i���n��<00@���������������I2��l��r=9����U�TF�����������l!_+k�.�<~�xNNNaa!9�>**������J�d�@PRRB�$y{{;�`K6�.�9s����FgU<n�'&&H�q����l�Z��W2������&�X��x�
:����p@&'
�I���tuu�/����{gH����z�].}U7�������x�le?����.G�����z���uXr��E��I�������y2|�FL&������n�U���MW���Wkk+���B�}+�L&;v��,��y���wooo���[��o%j�Z�ku�����T7]�����d4f)22�����~������fh�����l���xP�s���'��uKKKUUUww��bqx���o��j�vwwWUU544��������
r����j

�����E�7]
��m������b��~��(�R)������=�^_QQA��m�\�=S&�UVV��:����������"g�f����k������Goq����VUU�����r�^-������i����xEE����M�l���������J�Db�����p���W��K��H�R�}n�������P]]MZ�6])7B�xll�������~������s����l���Q�������Ho��[��{�f�


UWWo��h+++����������B$���9�U�\cc��M�7u[:�(��������a�X��9s���;���<y�j��]��9h��vww��c����������WM��6^�z#���hkk����VG�M$���q=�"�n[�za�Xcbb������o�]KKK���333t��f+..���eb&�Ri�f�������R�t�'l�����YYY111)))��j�655���_�|977�������f2����7�1�VSSSyy�Z��X,������*��V�8::����V�IQ"����0t���=��������'�h1���fgg�f���LBB�(�X\TT���j�X&&&RSS�f��[���LJJB�%&&�t:��������R�zzz�J�V�������$++�b����'$$�9��(�����'�������2==-��z{{���O�S(���]]]��u�X,III��\���IQTMMMWW]?66677G^���L����LNN��~||���X&���������T"� ��jjjz{{�VkVV�����������6�������%55�>�������TWW��������zR/��Bassseeebb"9"�J�B�0++K,_�z5..nuu�aB[�A�B������������q��H/&EQB�puuuttT&�������������X,&����������t���x~~>y<??�������>������rjj��`�(J�T&$$�������(��������&�������A��b�dff*�%''�FR���^ZZJr\vv6�\*++;::�"�J

�nzz�����^�P���`�Z�juEE9�����|�ryyyUU�X,������ K���311A/BVVVkk���m6�P(��
KJJ�_�^/

�M�v�s:�.33��}O*�fdd���o<@�uww����B����������:a���vtttvv��BW��_�r�L�����n�,//'''��bFGG�)�f�����!4���IIIuuub�877���@&�mQ���v��U2			��G������3�V���M����QQQEEE7z���,9�joo/**���bbb�� �Y2�����~��b����� 6���D,KLL���-��,�(�I^�(J�Vw�S��555�P�M�u:]ll�F�ihh i���'!!��
�m���[���G�����\�T���*��7�K��$''��@��VZZJ�	�B2�����i%����B���������j�������q��Y\\l?���������(�Lf2�H��fKII!�cnTOZt���VVV��������������-�2�����f����%%%��$�]���mmmt�^�OII!}J��j�*��������������m6����rrrHG�]�A�V���855E�KJJ��[Y�����������:11���Bx����F�N�knn&m�[���_RR��|�]������pr�T*U*�)))t�kmm����S�I������'�������������8�RYPP@v�d2��[��[�G?77���CN�������$O
����������yR��j���._�WWWG���������n:���knn��'��������'y����G}}}EMLL<d���333�����k��O?I*#""����k����zkff���U���w��V[[����vww���'N����E�������7����S������=��B����ruu�H$j����>:t��p�}����:�D���r�xuuu���E>|���c������R������644D��B�055���������;���pL&EQ[;)�������~FQ����}�{^^^���/_x��Vk����w����#00����^^^�F���d
p���;::�{����e��
���/~o��;vX,�����������|>��X,��~x�����v�����r7���b������M�Eyzz�:u�����eii�<����/]�$�>A�J�������-�<���/����5O>�d?y�����������7�(++#�_y��������W_}������������Ct���>�����}�DQThh���{�;v���7�x�����(�JNNvuu=z��;���w�^��l�X���8p��QWWW�P������jkk}�����	HJJ2�o�(���2   **J�R��BTWW����`A33�<b0�X.p�l5gu\����>�I�7iY(�l6��l0��V�
��Y�1���a��"���F���@QoJ�L�6�8��?���BqE�j�n�����V��j��STj�c����}bI>�$g��jX�zU��~EG�XE�|IR,G��(����7��b���{N:��e���_�zm�����VJ�d'y��y�>�1�T���=+�<Ml�@�?M.
s�#�!�njQ;���J�j�
�����e�"�	�Iy`k&�q|I.�c���T�c����vF��|��''	�A����N�,�(:]d�9��c�8���yvp�@�B�`��-�{V����Pp������*�*:::((��������

�|�������o|v���2�@���i�|w�`(���/^|����c���w��9r���%==��(�P��O9r������w��j���g7���4��j�nDC��
N����������y������^#�����|�M�����W^^NQ��f{����6>�������Q������<m�X�<(:D/..~�k_�(���E�R���>�,44��6>����=�����y����|>�<�X,_��W(���c�������g����n�!w	�d�}�y088����<����w��(j��]��Tzzz�9s������CLOOKnGl�Y��_��Y!.Bl��b2�m���B������������EQnnnr��T~���B�������C`;Y�)��+G�HQ���:�=�d��6,_����Ob���p<8;;����K$�Z�����'E}��������j�D"qqq������g7��6l6��5y�B3���V�3]�[�]�|G�`4��<HQ����]]]w���������&&&^��Gyd��=���o��������xI_}����������&������<(&����3�/����6,�����q�y��3�w����&���5���re���~ ���)5sf��fW'��G-�l�����F?'��3'�����re�\��3��h1�5�kV��R���p[�&uF{�@�W�g�&������c��6�N-���9�����~EG�`nnw|X�od�������dF�����s�]�/>��<�Eo��$F5�M����Sd�N������9]WH�b'�{�{���9_|6yn��j7�e:R������f$�9�������p�pj9�5+)����Id9	
��r��dS
�/2���;�&����1����I�R� �L���g:�x��f����������/��C�@vY��B������>a�|[�|[�`�@����e�(����W��F?��O���,��$����&/l�n�oX[�
�,�N�u��&��G�4&���Yb��6����
$8d'���c���:/��
�#��bR���/�g-
��{��_F����_<`�?�Bp8N��yn� ��8'�A�-he-8�t�@���q2��\�[#/�<G8C �m���t]V�q�)����������kLeS
��~���hE���mI��/�K#��%ytS�E��2���<���k��Eg�l�d��\��s���b�h�@�O���<G8C <�&���|��#�T�c����k��"�Y��(�M�������y�	���q�sd��\���V���IN���q�P�|������+{�,�*�;c������ ��f�q�#�$*�r9�W#��A���j��%I/p�#�!��w <��V�@�o8�y����3�Z=�_�'�|_�`.�D6�M��i�n�<S�����Y	y\1��vO��4���=k�d.�Os�����*�j���I��	yn���&J��r�`}%�7P���d?�\Q�v��L^w�;�+
�H������������yzB��q���������LR����1]��N�$:�$��Js��q�I�����K`-�Df�im�L�E��`i�zSa0�����D�+-g�.�u'������&q����$H"]���"'4K�����%��X��
�����Z��<���S�#�YF��jLs���3�����������������S+e�*8o���%)��^i(gmy�Zs�sv��<p��t���[�$�MZ�I;����y]q����~E{��UN�F1~��8�9H����VKdZv" GEO�7�?�����N�sx�[���z/�`���D4���G��:%J�	J��L9����/�,q���PV��o��y�9&l�9�}
z�S�y	����W�'y0G�9��eJ:��+����$F�M\�Elf�����f�7���N2n��79���#u������f3)��������/W�������j�[���s��X�d]V����BZk���������p���P$�$���T��W�W�?D�{�
�m����p�����6���Dez�����wCg���.	��8�M��L��8F��E���/Su�x���#g�A���l��N5N�~�Jz#��g �%�������F���$�������w��q2��U9�W:��zI_ZFG4�������xA��BE�YS����fx��!n�@�����o�j
X�IPz�Hi�H����L�����H9�$�����m��w��q2������]�2;���#-�_��<q�B%��kC�<p���V�r�_�<�E�B����6��'ypX��5+���c9~?Kg���K��S�*�d�;�����Ue�gp[���}}}��c���&&&�+�233���]�v=�������2""b��=�v�z���fff�Xp:+z�\�3���r��m��a
�,�A���J���!w�C�{��)�������������juss�c�=�P(���\]]%�Z�����:����l6��{V�����`�X��"��%!���$C����#���WZ��~y�Z�A���������/�����6K�>v���s�����j��+��������>�@(���<x������;�a}��c�����us�S3+Y���`fe��b��0l���Q������i�?�#��b�`v^w|^w<� b�EQ��@�-��T�����G}$
������������~��_���Z���j�����<���<}��������;w�d}��K��R�oX������fC�d��g���VD��reo�h�eqpZ{�9]W2;���;���� ��(�iKe�k���Vmr�&G�����0_�7r��nff��g�!�E*
����E||���w��a�X�c��'O�����|Rc�X~���������6�,i�of6�U��X�2qArS�����?���"yl1qo"�l"�ZY�8P#�/mO�W�7L���Y(���/��


����7}����t�\\\������?����K���T�T...��2�_�f'�/){8<@�� b{Df�&�N�4`�Z��j�����u���Nqq1]|��'�����w�}7??���o��FYYy��+���������������t��},�;�8���]�������g(��m���W�q�t�W�U�x�U[�zo��v#�Es�[?0mr�<�g����I��w��c��<x��7� ���������dWW��G����;{��5���������G�uuu���Ih�����W�Y����2�����~JJs�s{��K����v�l���?����P��j��PL��J��+�e���b~�6�^������'8H��U���7���5555IIIF��NL:RUYY�R�H�B�

���cw��{J�������x0U��E�,��F����}�9�����48c���Z��u��N>�F��[?�\�Q7
�Z���F��f�X��O��d7!<xt&�t��>������/���$�E��|!M��o�7���S���M�Y3�2]�7��&���YS}����9W�)o��N9u�(�]r�q2=���������T\$�b�6�I��j'�~�������.����z��l�$��_����h\���_�"��'�����~���wY�DQ��]%�zV��c��T14kZ?����Q���KN5N�9�`�l���\�'�*�����q1T���]��&�Yo��-N,�"~�L���`��R��0�����k,u�<8�p��I���S���%'���i��Z�ER
'������*�+�g��j�X����R��/��c��7���3��9'�-=���M�dg�?l~��u�8�L�t����yLa��+��N�H����i�����b�hmP.'��<�\nW�@�/��4Q����_5��t��(l�p��gW�X��mz"��������R�q��*�Gz�&�A3�d�X����f�"�=�7i�FJ�\���9YFGtAOEA������f8I����)��8����FM^���9b
�p1���.]Y���S�yj@l�@�{No�r�������c�9?�9�n��n���bI_�@�?[�E���q�����lUd�J�I��W�\����#���~CZ�����B2W�T�q0�`"����$.t�7�������=����Qdt%s���4wa�gW�N�;��,�t�R��O������<����d��\VG�y�c�Y �G6�d�����L���y�`�L;��qb�� �[�+����0��#������<�<�@�V ��e������6��F�l&A���dh�$R!b�� �s\��q�<hK�B��5@
�9�P�H�6H&�Q�����W�<���~iE��~����KBb[N;O,������U���o����.�U$e��"�K��R97
G 
�A'Q����k�-Z����SQM�Z'�Y��N8N&s���U_�R�6b,k��7�c��@l�@������u�s�S3+���\%�Y�`�8���#W����p�L�L;8c�)u�]��v���D|a[I��:�d�����"'#dF����u��~9=H�>�0�N���3��<X��N:���v5�2��]���IVY���i�l�O�^`���/��"^��u>���}�O�A����6�1��J��S���d�aW��V������h�Bm�	�\%�#�9�����x�������!�����'yp@�)����N(�M��I��I ]i���������Ne���XM��yP#�����}����v�3i�����f���t�N ��4�9^E(���W�2_�a-���R�x�&2���'�e7�Q�7ru�6b;��v5�����,�������sq�s\��c����i3�Oah0��gL��'�1Y������
���jfe,�#��?������66� �y�D��3�F�#��s����{��B�@�} nc\��q�<x��6�8��1N���Bw�D
���s��[��R�l�ZA �@���'.:~��RU=;R��U,�j��;�q������W�d$���.��pu"��,�y�U������s�[F w���'S�'l����FK�d-����)�2 ��k�l����������������D�+�3��m��m��FK��1��>�����./�����#���w��n��2�������|��B��8C ncj�j�<��88��,O����\1BJ���I���%�"�d9�A]��m�'�iR�5�A"7���S�5�9�*Y��~���obt3*������@��$]�V�v���Jg���I��b����\L�A�t�����'���n}��vL8�����2/�X-V������i�$����h�_�4>�j�6����o�m��2�I�`�T�����x�����u+:�����b����_|s�<�]u�J"~4����6&�ir���m��?j�hVY����2���`.��+�VQ5����$y��U[�����u��mt�����EY�)	Mg����h���<����L��9�������T�>fn3O*�(�L��U��������4�/�1�`���C6j���,e�f6,g�7i������L�Fu�P�d���<X Y;��Pz=O1Z�k�C�d5~��P9�4�.Yf��-�g�V��"R��.s|��o��������X��d����������h�H���/�7�8�����f���:�fd�_ ���Gh�jr�O����>��/�Sd�����Y����k���i��
�z���]��,�ru�y���.iM�������Fj��-��p���,D���F���0-i��Y����;��1Z���l�r+zUrkh�\��b�LKFGtI��\���b��$�-B �_��P��Z��y�l��x�P�������h��K�?��+6����;�~7�����U+'���i��v�����u>G$'���e6pui��U+���L����-
&g�_a�k�18�x���2�/]�v�T���]�|yyyy�[6>�R��������������,�����d=�Ipx�?8o����}~e�uW���y��p��dv�<�q2����������������P(|��'�9���o����j���g��f�y���8r����Kzz:wK�h����M��8L����;0ts;'�6��q2��o�~_����I�nZ�V�����}�����Se��^z�%�C���������k�����7�|����
�y��g�H�i�LQ^,Yf'BK����Tv"	n��������O�m�d!�<vww��_��?��?<���o�����<EQ...*������>

���g��C�H�������5��cs�+c%}i��El����#^��?�,�T6���+n�����q=F�b���YM\�����ij��nO�<�N8s�hLL���??99���N�>����R�c��-��������-�=y�$��'5��+_�
������V�����r�"a��s�'��y0_�v
C�������I�:��99�W
�4i�F���������U��8�����=�Z��E����`��s���<s����6>���Ej�z����7~�����u������(Q@��7��J���B\���v�x���yto�O�Ll����R6�\��)Z��;IT6��W��uC^Is�SL��
���U�B*�6�[��d�-�d�c��"a��s�����4Ur���mM�vNV���4���&D"QII	y�R�H�����\.'�~��P(���g�B���I���������U�rnW�@�Om�~V{������td�x�D��;������Ej�z�Zo������������PL�3r�uoW��	'�|8y��H$����555������>����>������wj�Z"����LMM��e������?��D"Q���1������TAOR~wk�u�m����_�*�IS �~�<���bf��H���0r��b-�s�75�zZ�7���\���������e���z[����W��Zg�Z�\(���^���Y*:I8y�(�������w����5
EQ����#�<�g����p���z�<������������s���333�-���^�����������U�u�k]��QC����U�hQ�������9����]��z"�Ta�(R�tqFe�E��6g0��Z���r�J����E������r�p��N�
D���3\��3�K����/�9�5nb�8d���G�J?s�/e��d2�Lf��"E�S�M��l6*Kl���_^�4O�xqa��~lH���Y�oE��m�o�M�&�j`��3'��q�s\����en�m�bZe1�lb���e�b�*���P`����/@��5�UcX;<d���[9<Iy������?}jytjy����������BO�xu\cA���KU��DP���w-p�<8���2�V���e����]f�����b��3��~����PO�O7����Iv}��,;�N�py����U�3�;e�0�Kq1�������
[���<��e�"��7V�q{�����S��l2��V��z+x���~�-k�`��vQ���r�����������-��i9�P����*7CU�Y�~,n��Z/z�X���)\b-.�,g7k����`��<����m��<X�6�y����H��c��.y-��:MY�]�y���4��)��<��lO��A����<X(�H�������x���4��]��$\�����*�������[,3�}�#��3T{�D�t�/{����M��+'��q��q2��U�
���W�l�0Z,�j�+��}����5�iz���X����a�g���V���EPn�Xqb;��B��6���u�I�$I
(��.�������'
�r���Y����2�M�j{���Z�F�������7��/�������&�G�uH�~���h�]�.kmL�^A����u��'yp�kp��b ��b^���+���?$
]*[n5�y�64kf�80m*h5���uZ������+V��:��b3���H'���N�G4�'�^$�Lm��t1���?]���v5��$���^�E��b��3&�/���d�7�����]�&s�Vq������-*W�Q�mC��6�Ty�Hf��YM@���o0���X]����6�3�O��V�z��|�� ~&� ���������'��\���AfUt�rx�r"k1�0�+-:#�m�m��6���{^3�\��4
��/N�������ok��e1��C�U�.v��M��G#�M��+6��s�-��XmkCe��z�i�(�����?��1Z4�mf��uC�W���$	������uk
�A����+B'�3[��6Xm�gw�d����l�c��Y+�E`G��1�n���jV���Y���?��Ki�~�]u����v�]�����X��u�L���Mo�O.�?p��<wM{�D�-��M�'yP��M��Tw��8�9nL��d5]�?QN���b}��k����]8�3aJ����|y�����"� ��z.�=i��7O(�
�c����������,�'y����Y�N�h�z���q��?�	�Apf'���<H�zR�q2�1��YPn(m7�u���B�En���.O�q2���&����2������&��1]4����e4�H���IK|�Q2la�@�T���������
Rc0|�l|K}}}PP�����������

jjjbw���J�z��g~���=z�o��o�|>EQ������7o��B�O<q�����~{���V��l6������G�qqqIOO�hinOdd����c�\���~�����ww��e��}���E�l��^z���������^#��������l�>�=3::��SOQ������W^��W��g����8������T*����>

�"5���_���X�w��b�Z?�����4���z������0<<��o[&���r��V��+W�������<y�4�Re�X����l����i)����iVR��Y������yyy�:r�Hpp�}��]���W����<s�Lpp��������w�fe��������{���t��S�t:y����_�x���nnnr��<������N���!777g��i4��_~9>>���_��_��;g2��������nll���O?��w���Z��H$...SSS����?��D"Q����'�|��r�������EQ�����O���#�|�[�*..&�$OQ511����?��#{��	'����wuu��s����gff�[ �{ifff����v�z���###o�����={������������b�����S������=��B������)WWW�D�V�?���C�q7�w���mhh�<����B������<x������;�a}�����w����#00p�gO�>���Ej�z���;Y�e�{f����<���9y���������|Rc�X~������v�����iw���eii�<����/]�������T�T...��2�=��o��������J]]��������������t��}��2�=������z���w�yg���f���(__�=k�X���8p��QWWW�q5���������(�JEj�<���
�",,,((���n��lNJJ:{����t~~�SO=�W�W�O�fz�333�z���v�����������/--����������;w�������eR����I;x���X���b�������!!!O<������{����Y�'��������j��?�c��]��������MOO��7]\\v���{�n77�O>���������W���SO�������e{�~��.��]�6��_���7�<z��k������dlllrr����g����?���~����~�^o�XT*Uii���+��}������WWW����{�=�Z��F�����
0:�����^ziyy�����������4Mpp��o���t{{{�~��?��������k����j����p�9744<�������E��

}������gt�����<��\.�������<R���u��
�,�vYNe��]:��\���X������m���;w�j����AAA�N��G�h4�qGG�������16*�Bq�����^,�v���:u��������OOORi0��QPP��N�{���������>��t�=�P�V�w����tO�:EO�j�����?q�k���+��X�e9�g�y���V�V�����o�[Ry�����������#�J*8��t�|����n�(�_~������Q����[�:t��Z�fgU\\���X���D����g��������ct����P�����?����4EQr���'�dt�QQQp��H$����N722������V��������gg-;�v�]��,'Q]]��o|�/��/��9�����7�����~��o�>S�!aaa����������z��W����nHH���>[^^N����?����mT����#���7��_�l�O?�������o��s��������7�QUU��t�F�/��]�v���/������~����F�����w���^{���C|>��������?�������J�z��g���S������/��?�$�vY�����Z�����p����255���z��i�@P__O_O�Q���"���F,����0i�T*es3�������RRR�J%k�&���1�������������B����$GDD��h4���p&Ms��
�,�����BBB����^��S`��tm6[^^^```KK]ym���-..^�x1""buu���r8]x������_?x��������'?�j�����BL���Ru���g�}���/����E*Y7"�J���{���������o~������?������4]`��O>��`o�7f�����vg���K���*---������(�gL��������������#��a!�������?�����I��������m������������:�t)��p������N���N700��������^ziaa���3���t)�z��G���V����s����y��^OQ�������I%�kp5]`GGG���>k2��d�E����111�.s�����o�vhh��)((���~����eg�E���������JQ��d��������<���/&%%��tuu=��s#X��t�������&%%���1]��;22������N��k�J������nWW�/�`_�g��OZ�p����;Z�

endstream
endobj
24 0 obj
   18905
endobj
16 0 obj
<< /Length 25 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yp�a?|Od[����S���q��IG�k�q;����8���������:y�$���	�A��(��(��x�����C<�x��} x n�;�����(��+���<�yp�.��b�yv���)�!000::��}qq��������)���qqq���������&������b�����U�MtU�T�����<o��^�������a+
��O��X,w4	��������vCC����]]]�|�����-�!�F���y��iWW�K�.555�|�:�.77���3...�������j���&�)33������;??�l6��~LL��S�VWW�Z �(??���U.�[7��W�����nnn�?��\S�.]�^�Z�i[[[����9g-�r���k~~�mi�X|||�B��Nb����� �����������3�-�����t/^��������X,���6		��������{~~~nn���������{{{y<�X,f}�v���9���,�F�JEB�>>VWW���y<^yy����N�koo�����x###l��&�V���^VV�������������x���w���6���y<^NN�&-����"����?>>n2��������,}}}<���S����������������vjj���?��F�1   ((�N�*�����g�1800������i�&"�������Y7vuu�������8��i�d2�������*++y<�J������9h��r�)o]@@������[���y<^JJ
�����x���j�7n���x�������R\\Ln����������x���l-
��!?�������.��������h�k��5y���JRR������Wzz��`����v�������WJJ�N��_auu����������������yyyz�~��\\\tqq	���+W��d�u�3���<o```�oY������www//�������X?f`` ::���S...|>?<<|pp��z{{}||���._�l��������&���),,�zZ����������g����-..Z/���W]\\lm���]�|y���fsUUU``���������W�R��[�P(���<o���������w�,]aaa\\���[hh(i7���~~~d�rrrl���juzz��S����#""����"���#""���O�>����V�m`��O��3�o!�������tww����yA�e���i3KZ�)���<�z���d��x/^$�/�����MNN� �
����/(( �����|>�������Ao*�������3���YYY����gH���x�bTTTAA��s�x<^RR���w\\\~~��3gx<^FFy��d"������������g��x�M:����y<^EE�u������n��X^^.//'����@��*���\����@ �x���:����y<���wfffqqqLL�����H^���3������---�������������P�JVP6����������q������s{{�V>���%///�PXTTD6���=k�o��������
���������F~~>���p�BAAAvv��S��_�,���OHHHaaaHHy�H���;@����;(((//����<��NxxxaaaTT������t.\ �V���AAA�O�������|�raa!��~��Ez����LmZ(����qqq!\NN�I�_���o���_�um�YJ�<�*T*�����$U����h�h42��z�����r;$$���������b�k�����+�J����M~e���������LLL�u�����x���H��@.�[�����;��6��AF������6Y���F20��i+++��:�z�jrr���"�
UUU�/�n�f���z������������������7Y'o�����X����C�DB�������������'�Hurr����;w�z1322��l�XHY���;@�����z/��m=x�����
${��������fw�����d��b��������p���F�����^^^������������K�e���i3Kw�=l��(..��x������������[��f�?�T'''===���fgg)���x"�h��?'��.99�n����_�-� �sGv��x���%��ug`yy���u�����hG��9Ho�����ZM�����lvu�<�zW���(���r�����F�����k���
�Lf���V7Y��P����~�������u�N,�������'(�J���������A��#[p�����6��P(����2l������tIII�#[���f	�~�u?����]���H*�

�~��S�4���P�(*33���utt���[�4�y��������������0z=��������n�X����������]^^&?��/��f�����A�M�=��C�GWTTD���7�c�bZ�Q$-�?���teee���dgm��$wu �������C�����_4��C�t:���[�����?O��l���Mp�]�]�6�6�]�d{s�N����/�e������ms�$Y:77�u�
����?���bKKy������o�����`����;����;@��YO���������MT���V����~�.yxxX?��)w����]��7�]��|�6�t���E����s���A��yyyY���������G6c��������n�?��[�lL��&lvF�?`���H�������WTT�]j�������C��dk�&�����~���n�%K�%&������ ++�����:6y+�:I�=�~�o���L&�TED���k��d�_��Z�|��L&������'�������A��������u���nqq����<�|O����i�\]]��a���p�$�lpp�F!�������_�um>lurr���"�hiiI�P��Wr����<d�o��	+++|>������
�m1���]�o�B���^�J����*y�F�|���l:�o��k�tR�2226Y��y �t�=������~@NN�����u666�x<�Qv���<���a+�`0����7��G���8dl��nD.�WUU������<o�Xw��x�}�6��,������w����^m���9�66��q}�6�p��6$��7��HZ�k%��_�vm�����xxx����Q�����$����w������?�oh�]=W�\��x���[_���z�K��VVV�.��NlE����������=B�43������Mr��x��c�zA�����:)�F#Y���a��������~/����F��E���o�uHQ����MP����CJ6���~�]:r���a�JUZZJ��!��jkk����;�A�=�d�}�m��A��z{{[��loo/--%��_a+��="_���QR5�Ld�,|��M|"Yl3*��\��Q���� 9�RBB�J�ZYY���"]ow�������^�����{�6�������xMMM[_��@


������&[����d�������Fcww������+�����d��N�[XXHJJ"O�O��*��9'''�x��TUU�A��auu�fA�z&99yeeE�R��|xx�&KM��F�������R���8�[���#k�u�sb6��|~zz�F��?044T�P�����	������~�]���qww���V��(�����~���������gWW��h����_v�����tvvzxx����;n�s��q�d�***J�T�����&WWWooo�
��V>�{D^���+333//��v�����o+++��,
��$�Qgf��������������
���mspbb���"�^�z��E�f���
]\\*++�*��(]k6c#o�d�����aaa����<F(������������g}���/�,����r������<fjj�oCv[Y/�Z�&���BBB�c�o������,	

�^k�`�JD���d��FY��������6z+�g�����>���f�6�	�B��---�����d���������1�[�9l%5���:}||�A��.�m?�{���mf),,��dO�������������{P�W���
��9�����z���655UTTtuu�L&���<x+-f�����������t��3��M	{f�������s�s������n��1<<���s���X,b��f=@Q�B��D���d��V�---%���]p�?����������y���\uu5(b� �ehh��M6������������zhh��W�`0xzz������������Z�]��b���)//�H$�]��o��o���3U��R������*_ZZ������${���P6B?xll�������z����s���X,���Q�����Y�#�h�7�4���b������������X^^���h��"���;d���
�S��o�]����6td�hcc#�3[b2���?������=9h6�/������
;;;3�	�{r���������Oq�������G��kmme�\�w���@�D���\��H�����G��a2�������###�������������������b�TWW31�R�4|=����j�F�Tj�
�,���dzzzdddRRR?i4��


�����]����/�Q��CJJ��3�jhh())Q�T&�itt466V�Tn��������*��T%Iff&C�������r�|�W�h1fgg�F���L\\y)�X������b2�&&&����F�&��+//OHH@�����WWW�j]]][[�R�����5Mkk����%Izz��d����3��(�N������255e�����������Z��7�d������N�K�o��&�)!!a��o��EUUUuvv��cccsss��KKKt{UU����F����6��---t>vvvI$� ���������l6����;<;::�l�X�����������nyy9))���R,�������v�H$
��������T*
����b����,&&fee�fB���L&KNN������LLLloo�m&��IQ�P(\YYmii��i}�_�XL��u�'''322�������r{~~����Z�v�MZ�,--%''�t:��
E\\���J[[[nn�F����ccc�KKK�m�����&�)11Q���������"�qt��������LE*����J����4EQ===B�P.���f�JUZZJ6������]+))�����999���dI333'&&�EHOOonn���b��B���������j�B�P&��v�.pnuu5--�z��T*MMM�����@�uuu����Bz?'EQ���MMMt`eee���wttX_BW��FGG��X_b~��������������/fDD��%;;�����MHH�����YYY���---������������3�Ld;������i4���zK�XXX�����Y���d����-??��������A�Y---MMM�sM&SDD�}@��[��i3�t��$�EW�b1�5��T*U�M*�����J�n���jTT�Z����#����G��7�������Y�-Ar�@VV�T*MNN��Z�NLL$;-KQQ�Y'
�hk���dG(1??/
�=���+++����Bjj������z�iUUU__EQ---��4Z,���$2<f�v�G���5..nyy�����'��?���eff���lii),,�X,sss			��-�fWVVVkk+���j���FFFH�_jj�B�0��2�����333z�>!!����h4...fff���{�A�F?55E��
��,���XVV�����l���HJJ";x�������z���jcc#���I;	�������u��E�O			6�K�R�B���Dg_ss��
��NR��q~~��GHLL����~�:��P(rss�������-��&�����eff������������~�P���9??O5Mqq��k�bbbjjj�m�����aJJ�����3��SsssO>�$]�t���O>��#����?����(jbb�!+6O���y��7������O�������������?������U��������t���~��g���T*��S�^|�E��RRR>����^�7�T�����{L&�MMM9::J$�J���;99��@w��g�-,,�sP$�����+++{���(�������/^W�T6�p�����Ar��?
����G�%-}}}���wY\ �;`0(����IQTjj����KQ��~�����nnn����^x��Fc������t:r������snnn�E���0�o��`{{�������DQ��s�d2i��/k��={��L&r������3�O������d2=����S����X���f&�n�&�b�~��]�E����={�����aqq��>v����W�'�T*��w�{e�����/���������O��������^NN��s�|����br��W^������}��WIKQQ����YY��D��X,~��G�?��'E]�r���C���G�}��7�^PrEQ������'O�|��w:d4M&����9r��IGGG�P����j������WUU���&$$��z��PU^^����T*I�L&������a}i�_��A�����['��SM��+Y���&�"~q_�rU�\�7�W�P5�
���1�����N��J��;�V���8V1��7���2UW���turiD �G���z�Z�R����Lz��r�����IH�\��ZR�#�:cH�_��NuI���[�������3���D!'9���d���0V�y<!�}�~bqxH��y<!��+��gr�n�f��y���j�@���sL9�y6!�d���[ �'4)��
������jp����f��	9��
)��ov����Uf����6���������	9��Q_2Y�yL����U[Cv��r���1�3b7� �X�2�2�f#�W�0W�W�����~�3b7� ��,��VZZ ��,O���2�lh��U��9)��K
r`s
���B�h��f��>�J�H������V�b�� �&��r��{�Z8_]�0T���HnD���F�W�(� �&�=u�����W�,�!EO��c@��BuP���oM-�r�J����r����6��R��3\��^�w"�C�-��C��m��u����y=�y=��V���R�#F��A����8����3N��?=�-<�+��;%�;!�'��CE����X,�a��z�����b�i� �{v�Q��d�%�9H
W'�C��l1��wD��&�������T�������w�t�sO����������(�C0�9(��\I:r`]�N��+��,Ywu�J��O�I�d��a��kIj\�w���a��r��j����-����������B2?Q������,T�g��vlI�|[�T���Z�wE��D��������������Vs��9	AR��!���%���QM?�+���U�D�_m�-_�<��,�ce�]qY�1��	$)��b��r^��bo�O<�y'�_m�-�m9X5��y��2�����
p{��Aa��w���SZCC�N�uDr;� ���t�����8����5M���<&v|A�Ag�)z�9:��e��.���O��.�AxP�e��`xx8}[�TFDD���544�����~~~��][ZZ��������AvJNwR\caxu%k%��<Y���r0((���"��F��C��9r��	������B�O<q�����~���Cf��������"�c��q2������u�88��8��8�y���r*�7�#e���|�����9X]]��k���999o��EQ�.))!��{���l6���]�E`��p����<:�;c��O���2Z�wG��������?���v<n������$#""L&��������^XX��7�AQ����R�$�����+W�_���u_v�����
D��m�E���z��4]������V�{��;��|������%
� )I�������g��������d����FQ��={�}����>>>����w���������������W�:3�;>�;���\u�T'����b6��<R���4K]��^d��$)����T�����wss#��Z����)���o�Nw�2(�������~���������i)<������|�������`������qSn�	�G�
��)��iz�_��VwNJ�����9(
�=Jn<x����

���>�H(Z?���u_v�x��7���q�I8������,��9C�]�P��)d��g+�I�U���������?��D"Q�T�|����~JQ�g�}����^�RI$���)�������&�����5hn�]�e�j2S����+zX������=��^��|A3�!�KT���X��M�U^.mH���Yzf��}��A��.]�����w��7�xcff�������_��G9p�@HH��S����E`�)5����������niL9 ��Z�V��������*�CpX���+�3;���A��>:|������,��.�J�l�`a����Es��}��_<H&���_
[�IRt�5gu���2]%�!*�� ����$�2;����u'�5��c��f^i�Z:�f��Tb�$�T�Ax��)��,o����D��F�������s�D�kbF�%��7Gr����0[4�8�$,m�p�����B?�!8���;�����r�$ENw|FG����*'e�`a�&�j�7S�y}�C�j��Y��'������rN��@��vC��9�U�7p���Ih��@����_����NL�f�5�V��� ��p*�MIk�b������8�%����\��9�����!�� ��X,��G�"�	D���X6C�zs����!\�<&v|A<��YAOr�p��h�v����JNr�n�$�=������Z9����:l"`CF�aX�+���_��L��L����F4�c��	���?h}�i����,��i����,1������xpM/�
����$:�J����j��
�{9�A��*)��j���x����K��_�*��/���A���x���Rw��xb��q2��	�����1���/�A���bRt7��sO������+�b��q ��L�����x@qu���B_��=�#*�E�=�LZ2;������������k�
r��a��"�"~Iz�pA�pAD�/���)�!�E�\(�`�hA�:S��n���+Qg4��*r���q2U���Q�VG��wD�1YuI�S���pNd��&)���\��%�N�X�.h�4���ue��;��4����A�n���
Y>~�d��.��h+:��c������g� ����j��>�N�%����Ru���Xd�:4�x�p8N&���@�g��.�3��
$/vC�k�E���������z5k��q=�?h�\�"Y+3�('��^���H�����
��?�`��7��A���d�_��r�f�`��.�B��Jk���w��q2�A��N*�A���������"��p5NfX�\��v������J����:w�U����g�n(�A���b2�������:�P���5������V��!���n\����|.1+l�d7������Y�&��+:W'�A� �criX ���G���������������	��������"~��KF���c13<a;W9x>g��E���N�W�Io���&u��~h�P��M�Q����X�f�5�
�3b7� ��[�-u��"~bs��j��e�zM��0V�x.�)���evF�4]Hhb���:�B�mT|�����	}v�&�F�������P-lY�Z0����
��JF���l"��`��.��?XeX���������x�/�f�T��R�g/�S��-g�5#�k��1�����`�:�B?��.�A�{��/�
i<�E��<h3�'�)v���<)��s��k� �P�� �W98���-Mi�=9X���5��e�r��F�7�����k��`a�fL��A��/�A�{�I)zR�������((�X���������B��Av#�D����������P(u�8'M�V>���<-/hf��.d�N����9p/�������9��)M\������N��f�����lA�#N��^-��A5>����yT�JJ�:��VHe���Vs$��-�+L�Y��������c@�EB�w��kN�h�����rr�����B��:UY�jb���S�;e��2�3\����� 
C9p/,��nI�"���������HF���t*9���������>��i����mTO_���h��z8
��� �H�eY�\�-2]���D��:��"/���XI`h�i���-_�`���~���c\��3�S���c,k��cef��h6hj���p�h60Z�\�*�Jn
���g�����ds0�fm����6QvHA��q�5T ��M7��L^�5�L�E���\ud�O2Y��A|��`J�:O�:��!�������`���Mk���u�9n�_����tq��`a�!����������Y��gEQ��6O���y��7������O�������������?�����\�u���3�"�C�|vU��x(;��9h-**����E���|��=�7�T�����{L&�MMM9::J$�J���;99�;��=NrpP�]7Z���mf7���j�/�U��yRl���r_������/�l0(�:~����������U*��#<888Hn���B�099������������.��\�W;f����8�BE7�!h��_���B�����>m����j����u
�v��*?��c�PHn��G?���������_���^�h4�����N�#�]\\��;���FZ�Z���{Y�}��B3'�c%Y>���NLi�H���S�yt��O-�ntk�;���������'q(��*�O���<��3dc���s���d2r����ell������c2��mOO�3g��>}����������ObzzZ
;T��VX�\Z*�g�5fG���/����f�G�����}(((w]�����V]�r���c�����J:
i���d>v����W�'�T*��e��p�?X��)�����*��V8�;���@��g{��w�)((��O>�d__���{����X?��7�,..&�_y��������W_}��>|��y��'9��6����r]�I��+�p"�{+�O8p`rr��^�r���C���G�}��7�^Pooorobb������'�y��C��F��t���#G��<y�����g�]��q2=��t$��{X+^)����y� 
����'O�>m4�[���|}}�z=i�s�����r__���p�RIZd2Ypp���_MM
����d�L����'l�W��IkPWt�RD((�Z���*�rI_ZQ�u�s�)n��8�l���L�Q'��9_���<�9;���V��;IHu�I%��]sR���'{��z�/f?��4�a���r�9;��b�]�����������$D���$��������n��n��,�`a���U��
e� 0G��-�h��c�:� �'	��q�K��+�����s��`���mi\��hJ���((�P����&���S���g"~zG�L=+S�2]���uD�v'�v~3��nRg����I���jZ�:�A�1�S����kA����U��jG�F���Zj��&n�n5��=s-���>Y�����A�D�dH�%^�R�����f7i�k�8�
��� l���1����|��q�#}����9iNwW9����Z05
����3����j��ap��lCA����#����,� �5�p��DAA���3e����OO�g�%��jY�V���n�Z����������a�~q
�Qu������A�m)}���wEs,nuUoY�[j���}��B6�)
�����p�������S�<W����Tr�&AAypK~���W4�F�Z,k���"�7vO�����)%���v��g��#�zY���
�\% )�b�����Zk6�,\�������@�Oj������	sN�S�b�9	A�4���VPP���u�.����q2��'�"�����c�g3���s�������DA���;m�7���=;���M��'�����/vO^`��.�8�
���t�a_uQ�n�N�I��(��((b����/��0�@�����Ij�L���*�Mf�+�z<~9���@���@���I/�K�$e�.�.��\d���V�I�K�Nw��A�{/�A�tF���3�3:�3:�="�����2:�1W�oJ�j{0_�v���� 
�=���h���ce5#����0Z�k�������p6�o&�f����I����Tv�ZB1]m�
��'�����2�����Y
*��|�;�`�0Dg��7�
D���q�z^��o8�hU2Y-�#�q���������Q]����u5�F�h�oJ?�0�(M��������5*�E���������L+�_W�0W0N�`2����ce����n2��j(���t�&�#2�9������;u�T�]��U����Y�����['*TiQ���td�M�|]��hA� l/������w�|��B_yw?��)v��S��e�z���i��s�����w�l�����5������,�`I���h���L����vN����� 
����q���u�$z	wf�8�v���
g3;���!y�G��Gz��u)'!����cu=92o�k�7
����w��4�}�����Y0N����X�lt�	D��e�,'�����e����~g�lP�,Cs�I�kl&
��������?&�${gz��}^�]� r�MJ��y�uCW�k�RY���jU�����=�����f�rU��8���!zm��`���9y�|>����b�\*��U'����E��b���j\�~��t�=9��z�����_�1��!i�\����x�0'9X��)j�t�����dX��d�Z�o��
^�kd�:0��5X*'��'3��Mn�*���C��!E���ga�9�+U'����l���dv��)S�T��q����mg�X���xe~Or�8 ��;�=�u�\�Z����@�4���r������6�Ha������h�[1N��o���XA����]o�G���M��fI����P0N����HL����`6���X@��'/�K5%mk%�nmc��jM���<7+
���6�`�����D�
z��
�^��r���������[ �_p������[�uE�����,T�/'���_uZ%��uP*����F�~�i[���j]�
���V&2W���j�,rt���9��U��Y����fi5����u8���6[T[�F���m�����}���[��LW������0N���BX�����'G���
���[]��!����q�g�:�`3�\���������Mf���]t���K{`�0aji<���JY���v�_^��r��������_���L���������^a������A����VtK��/c%�&��1	���U�[��.���������|���,������!c�����h��.��+f��;���D�������D�Q�������Pp�7������8��<#��X��-a�&��
,��������.B���5��,����u�I����+�sR�����/�����'���/b��-l�>�`�G�3~�Z?]����8�lA3/�h����y>����TrMR.�V�J�J�9��&�a_��q�\�6���;ef��E����z��"~Vg�9X�5B���%��7��j25�
_����������6~����M�{�I� ��{���7Hcmm�����k������b�R�������khh`w�73�2Y>�U;R�r:'�����a9s$����<u���l����w/����ja�JQT�8=����p�g?��o~��
�O<���'�~��C�������k4:t���'N888���p�4�VtKl� �A��F2N�L��>g~������&�lUP��_����t��B_�\���p�l6;88,//[7>|�����/���^�����������^#�������[�.�f8��/�&�8�e�DS��Z���%^�Xc������6\'�������������G}��������(���A�T�|���W�\�~���������,,,|��`}9����od}��u��O�R2t�+-���7h������*���Ddd���??99�����{�Q�g�z_��������S��=s���'-&��k_����>�q2��5��}�m$�����Wyf� 9x�h��/s��1�Q/B<�0N�[*��/��/(���o�Nw���������~������nnn�E��������������j*�������Q�'��b�S�S���3�>����MW��X+q��k����V��5u����K��������f>�nC$��J����<x����i�����B��S��
�G�%-���dw!6��/+zJ���z��b�.dvF��R3W�l?���E�8����a�O�6�N��w���)�8v�D�o��SSSZ������O?�(���>����R�$��������S�����}���%�J����O������"~����}��+��Y�A�<�.�~��.n[�
a���~��=y�����x� (�-�wo����/;::������>R��EMLL�����<���BBB��z�!rc�{/]�����w��7�xcff����3��MWE^,_�9wgTf��Z����)�hm�JD�n~��h������u-�K� �dBEWCoT]��e���u�#���]�l2|rm��Oc8������+�]z���7(?����V������W����N�a_�	8�'�]��J�����_�Mps��hyf�vBq+�*�e�F�=S���K�:��{������^-��9��/����'&��Y�l��5��E���3��MM�Z���+�^���n����E������V��t���5z��`rj�'hY�P�0��g�	D�3�����^���E#��-6��S�
d��:X�6h�b'}��D,�`�D�2�N:�J�T�@�/�O��ge����"����s�bfn��'9�\�N�U�)���7�]�!�F-�G�-���5��-Su�d��3�LW�ZG�q2a���2cb�:�F=:o��
E���j~�K�[��,_0�u�-����>�Y��S��Y���W4a��3e��5��v�y��'9H�@r���XA�W����q��'3�rVv>�]������Nr�#����a9s����'_�^k+k�n$!�9+a���:�Kn����+������8�3�������1��1�T�+d��&�Y;�g��S�$��[m�
Q�ak0�XmJ[����\���7.�����!E����I����<��v�O�+b��?c� Hs,nu\�(�[������������-�ZTZs��U��������Y��ns0�v����&C��@_��j���hF�W�;�����Y�b��LoPq���V����E�Ii���K���B0�x	��nq�����-M*�\/=�v�������'UEk�,j����c�S��zuV��\�=�I�BUP�������s8�5�O�Q���vM��&���Z�#�F����A����I:���4���O��*�8�pO�5�|������8�`��A�$����x>���1Nr0�Q��uP1r8�I���A��0Nv3Nr0I�;������A����)G�o�uX_����jy��?�Iuz��v���a_��B�b��X1�T�Y����*���E�c�~#?M[�nd�
��Jexx���oii)i��t�V��R[[���w�����%�E"""�����}���T*�y��_��'O�����>�OQT}}�����m��B���'�8q���o�}��!��l4:t���'N888���p�4w&,,������������u�����7z����KJJ(��X,/��RCCCuu�k��F����y�����}�m3::��SOQ��_���W^����~�������988(�Jr���?�r��@ prr"-���7X�w�{b6�?������S��SO����t�����|�;---����g��|������>>>g��!;T)�2�L_����_zzZ
`ezz����=����?�!;;���'N���[����O��u�NWW������������V��~V�`,--�����N��[��=���Jn;v,((���"�?��#�Mt'������Y�}���V�_~����X�����x���`�������1�{?�������*�J"�888LMM���>����D�R}��'�~�)��p7BBB�sE�����?��#�<��o����<��EQ���������#�8p $$�4^�t���q���o�����w��fff�x��}��=���aaa[�744���������>33��\l�?��R�{�1�L���SSS����D�R}���NNN��;��:x���� ����
����MNN>z�(i�����w���,l�����t:r������s�����FZ�Z���{Y�e�m�g���Dn{zz�9sf�{O�>���I��dz����_3<<������3�iw���aqq��>v����W7�W �}�J������Y�6o��fqq1���+����l~omm����JZ���>��,l���DGG��'O���;�2�Ey{{ot��d:x���#GN�<���h3���S^^����T*I����+��������l6`��FcBB�����srr�z������
`z�333NNN�=���}�~�a���_�~����������{��������%��}�����C�0����'��CS�x������������LO������?++K�������}�^}�����������[o988���g������OYX^OO����O=�TJJ
����lm�����j�������~���N�<��k����?���|��������~��_����O���L&�RYTT�����t_{����O����������*���/����C`t�)))/�����RII�?��?��j���z������<������������^�r��F�			az�s]]���?������`2��Z����'�|��o0:����g�yfhhhtt��������I;���n�^a���*kW��o���*9e
}�777���l��{������'�����#�<�V��������ccc,|�d2���G��y�XLZ���~��Y��|}}]]]I�N�c�dD~~~d����=�93�N�{��G�����'mU*��}�����g�����?���N�b�S�m�+��X[e�*�<�Luu�J���������w���������^�J�6�G�	ft�O>�dWW]��/��������R���|���vrrR�T���


~���h�Z�H�����������{������9rD�RUWW?������E


=����N7<<���#6�����������}��t�l6��������|���{�U���]����[���_��_�?��w���o��g�}�;����`Hpp������G����|�������z��A�\��t�}�����%66��Ge�K�V�O�8�w�w���,�g�}��SO��w�{���>��''�o}�[�NW����W���o�/�������/������?dt��z�������|������G��K�.1:]�R��3�����(jii���_f�p�|���b����T������/�\�!SSS����O������TU]]-��[�bquu5��I�R6��CCCQQQ�7%%%)
�&M����;w������X]]-��9���/
Y�A"844��E�V����0i�n�^a�`ojj*000""��U�9=��@O�b�dgg�;w����n�q���-,,�������6Q�����o~��G�������������3���eg�E�:u��g������/����F��H���>��c�����O������������y��;i���'�|r���L���nxx8'��w�t�^�J�Wijjz���FGGYX?c��L�\����������?r�r��^�������?�����I���������������������Ru����_���];�s��Y7�����K/��r����.;��(��G��_0�����o.^��B>��#Z�������������\M���������]2]��>�����HL�9555��;88H�������?fz����3]��>���C�H�(�`0��'?���t����			�-����=����3]xp~����nIJJb��AL�������������g��Bq�����O3:����^x�����L���t�[�?���
endstream
endobj
25 0 obj
   19315
endobj
17 0 obj
<< /Length 26 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yP���?�Ls�����tI��4���z7���v�M'�Ng�m�n7�������n�6;������6����s�7H�S2�1��		��n=��_��0�x$��k>��}������y���M��)��w��1g����k�X�U�=l���z�5���nkk�t�����������e'+**��Mx�t:]II������k```YY�V�e��L���OOOoo���2��������/^\]]��M�����������f��l�X|||x<���p�������<<<
KKK
�����?������?�����'X,���8���v��T]]������������w]]���o��MQTGG;�{\MM
EQl����i��n����/��u���������-Zv���N77�-��-�044DQ������
������'EQ�i������\�877788�y��h		���x��{u�����(�J��/��up�g�����RRR(�JKK{����n��AQ��gZ)�***"���kUUy\UU����<��������	v����(�Jeff������W^^��`XXX�~�������WNN�N�c^���ZYYy���.��������z2�l6755������yzzFEE��b�e1���D"�������p�Btt����,�g���HMMuww���y��������o:i�Ub�jqq155���������G�����`���


rss���/..^w9I�V���]�x������OP���/\��������V��=��%�en�g�]����|jj���7��7n,--=h+jltt�A�rwwoll�X,z��~���W)����bZ������p�������R�C/--yxx0���7���+W���)�������NMM-++����(*??�<�d2��Hllleeeqq���?EQ�����2��._�\^^^TTt��E�������P���]]]������}}})�b���z{{GDD������n�z������w���*1������k�***�]�F�Hj�C�w�54�Ld>qqq���E���1�S��]�|�\������� ��u���'<<���"**�|��"���?�m�t������\]]y<^eeeBBEQ�����Gm��MmqX���c[?�����(�F��h4�!$�~���k���k��%%%���	�9�NFv%%%$���X����$�[����P����\��dEyzz�����L�NMMQH�ug��F�������B� -J���������{��>>>��}S���E=�b�U"����O��b�DDD0{��n��k(
��U������7���KKK#�5���d<n���3�Ld����(�jhh�~��K���=/���@ �����i�V"�h��x���uo��>�����(��f���(��������gFFFzz������M����E	�����.dG!��Hj2�H������M_������m�r�TSSc��6c����X��m�4]PP@QT__��������������h4n:u�Ub��2-���E��%����t
�\�u��iZ�RQj���Lw��}�:866��H$��"""��`�����O��ttt�C���q�$y�:H:�Z�9�z��y�eiiI 477������+W������f�����O."lzW�.dGa��1i��K����t������hr~�A�vuu�����?���(�����������kll����V\�t�9y����Dg�7[��[o��kH�nt���r�j}��j��Z�A������`�����O����zr4M�ssso��e��Z�.������+�`0X:��\�J��������l��v��;��[�Fcxx8������������/�J���������������������������
>JF�B��V[�����������[UU�q���!��Ss�[��79rwss�~��K���}h|�O�d2�����Dwww��`��:���V����s���Y�*_G�Tr8��w��*`7z�:H��EEE1'!WWW�6���dMMM^^^E���n�s���Y�9��Q�`{{;EQR��Q�w�*m]����!�Ob}�u����#�����A�S����u�=�eII�c����t����I�+W�<h�?$�,�;��[��_���Lrrr<<<����]�=n$�eee�T�z9�	XW�JKK��d�Y]]M&����~z��	{{{kjjH��G��iii�O��z������M�0==�����z�E�R���0�YYYE5773O W��Z�?!��rrrH���AQ�������u�	>����uW��Zww�G����du��������$�g)�MMM�0>>n��h��*`�z�:���G:����z}?s���GN����,..�����IR�&''i�njj�(*55�`0����
�^�'��x{{�>-��f�������m��[���u�����NLL���^�p����h4J�R2�����WWWOO�����x��f����1�����{xx����
E�4Ybzz�J�R*����d|�G��O��h4�s��y�XL>br��Wo����A�RI�I�r���������5a6����/]�d��AwU�.��u�l6[��������@�
�1�`0��'��������|zz�\;c���f�`�>>>L7���A��:��� [���u�����5�y�&��2x<������.0S�����Z�����A�������/���E���\�M(""bc��OVi��D��K�}�~�����T��xW�^��7��l6�������K`555����huuu"���<g�����|���y�D"!�===�g����[\*����������Mm�J�w�B���-�P�V�������������OX^^niiihh�H$����'���744����{�I������0�L�Vu��o�-O��h4��7o��������Q��,x�;��VVV���������7�u�P(d�_����
xJ�������r�mk�^G�L&SkkkZZZBBB^^������jii)%%evv�i�X,���|>�����q�!S�|�u�X,�8�-6sjj*///!!!33���;��l6������\�~��������o�`0���l�b`_mmm���*��d2I$����B�(/�H$iii*���"������������M�����d[��A�)���������F���ljj*�UGGGYY�R�4�L���YYYF�q�����������@iii���L�������P(n���4j4���n�����"�(//�d2MLL���25��i�Nw��������i�����tuu
2??s�T*������W��Wr�:h2����7���l&M�MMM���L������<y���2����455������u��]]]L}�������D��;_SS�����l���aNx���m|��b�����������w+++���


7n�hnn&�B��������������#2�X������:::jkk����J��mQ�RiVVV~~~BBBFFFoo�Ck"��4M�x<�R)�H��������&$$��f�����}jj*??�i���(..&�����Z�����,//gee�t:��SSS�JeOOOII����^�OIIikk�n���&�M&��7��ddd��z����SYYIj\~~>S\���z{{�-E,����������y<�L&3��*������=�f^�~����������������liAA���$�	yyy7o�|P��b��x�����
�/�V���xR����v��VWWo��a}~O,������l<@c�x<�<'M�)))���L�*,,������
��s�ZmRRYJ{{;�����-��:���������L��X,EEE����\zz�@ ���(,,,))�������������@jj��d"����e�A��L�����3G:�\.���+++{�������UOOOYYsL���`]I�������d^k2����7�M�������������Ij�vtt��F��J���G�R5551]I7m_]]MLLT��---����u+55�yJ���v�}����;$7�����������R���d��b���d�x<�mbmdd��%x<����[�n5440�\.���}��,//��k���444D�tWW��` ��%33�t�yP;9����������B��X�'�:�3���l<����UQQa�X�������;[��������n�E��fff����k������f�Y*�fgg����������������TPP@.�=e�h4iii�����`EEs��(�9>>^XX���b6�'''333�	�������^������N�yn�N
_JJJEE������;Szz�����b���bff&S�n��i��'���*i����@��#7ddd���ggg3�KJJH����.��G��[�G???_PP@n�x��$�������


H�F�����~�zrr�@ `��Nsrr$��+�:������=����=zT,�������<x��?����,M����~����_�����u3�t����4���Nzz�V�MKKsrr�izzz���Q$�T�O?�������?��R����_|�E�Tj=��S7�	��������/�4���u��I�844���|��i''������_����Y�v��Mg�3�������}zz:y���N��Z��h�>t��N�#��������/�8u���L���g����w�+��}}}9i7�L�=�M��>���d"����~~~�/�8u���333#�233�r���g�}��r��y
������i���%�x�����(�n���Lv������������bbb����{�=�RYYy��q����UUU������z&�n:��&66�����p��o~������[YY1�LNNN'N�8w����#�������x�������G�1�4M{{{��l���Lv �P����V�I�T*���

�>�������S(�����N�t&����Z�Qk�X�NMf�m�`#��Tq��d0�Mf��z{W�Q�V�No/t�{�`�[�*�[��BN�l��|hL>����;���vzk������Zk-O	�H�H���iM��N��r{o:=�2���rZ�k���o�� �k������x�8)R�����^��^����v:0{�����PR-����z*If���U�{�`_�-��,�����`�:����Z��6�"�^l�G���
�}-�E�UXR��#r�\����2�o���
�}-�Y�����Ttilc{���e4+}J�/��������F��Hk����h6��f4+��y\!�=�
u�K4���+��1�
F��h �+�����K���z�5�������D���9�D� +�O��,�����My�5]��^�m!u��C���.�I�b5�=�:2�,)�K,�OfF�,�d5�������kWL��n�f�����z���I���r�A&�����s�����Iu�����^z��������i�i`� �����&���%�� ����o�Oh��-�i���7���j�Ck��K=o������l�wv��[k������~\�!-� ����V�+�\�!�K���'dF�A���nMPU��������Wm\Q��������UPs6}�;W\�d�P��.%7��w���`/	u+.C�y�_�N&7*my3{�X�>���`������H�[	�J�
9n9�6��]��I��� � �N��-�t+���V>�<��-���������Y,���|���>Q�5h�
�n9�il��^u0�]]������A��o�#���y��A�8Z����U���W�0�dH,������G�d/�nQg���v8$D����S��]�)~�L+)X��~V���A�.�b��G�k�f��"M�D�^�?��3kh���(0fW&����BN�(lp�{p��}��4�:����k������u=��OW���1Pv����������H��hALj{yj{yRk>�iHM��$�u��:�����ov�����-��Q���D�����6s\sDv)�^9��E;���:4Mv�/����x���}tRu`��Z>�<��2}*i�o�;Wn���S\)YJmR���+���Vx�2����4�l�hb�����n6�oW����h&d��}sRu`
/_f���T�t�n�]�p�fs�R��n�#]�������u���LA���i��;:�m�
�}����H�..�fH��-��^.���
��.+��;S�L+W�)������zU�l��9��'�:�:`_J���yqno|�H!�S<%`5
�,f�b@D�����1�������������jms=� � �����m�O5wL�s���o�D�h��c��4����D�:�:`Gi��\!��<�4P��_���KS�����d����}tK����� 000**jaa��(���������6�i���AAA��__^^�8��S7�	�S��\~d��s.s�qG'��p��G�|��:��o}��������_}����i��x���'N���888����4���^~�e����'G�1���zn���L�F�X�[�T�|&m�.u��g��eb�����+��222B�<y244���;v����G4M?~�����i��r���u�x�n:��Z�d=����E�����������Y������^���O����\���3i������7i�vppP(����S����/�8u��<��B������K���%����&�Y��YME������X,����j�����!�&����M��>�,s.����������n:��X��R��w}0�A)�%R���!C�V��K'���f���������h44M����v�V{��!��<���H�����K��_�q��3YgffF��P�/l��h�'����F[2�)'k�J}8���������h�k�ka=E�$��e,W�M�������ccc$��x'O�$�GFF���h�vrr%�����y<��6N�t&���H��BNQ��zaQ��so�3VS�|)�O�[���f��x033���c2�������K/�$�T*�����_�4���_��P�T"����azz�z&�n:�K��������z2�J���~����b��Oup��������3V���i�stt<p���~8;;K������������>|���k���<�y����3��dri�g�-��[��L
���MVS����=Ov�Q�0�S��7�r��\�\���o�����O'��:80�[�f����������s&m�<��� ��t�,����[��N%�<��S���c#��A���;��M\!�l���u9�$k��|{6P`�(P\m
�w<(�qt��'6('pC��
�A��<�����y����E���nMJ�*�IU��Y���>M�@]��A>�r�����-w�����N��1L-���4y��c/���n��1����A��<�e�|?����:94�~����Ax\s�I��<��U��Y
#�������
�����)S����
�Ax,c��(�WbG��rz^9}wq�>���K���Y��MA\!���q�A��� <�QWs'/�fD�t��|hD6P3t#�/���-k�%�����SI����
�Ax,&�%�#�+�D6��W��W�D����~�"end�CDlg���3�,���g�GCj��s[��a�-
(��Az*I����q]������n������N%��~\�SI���j��6{,P����PD
|�B���9T=�pN����"�?P���M�E6{���������������Fer�2����Z�J�uA������������m�@�tK+�%Rc�mv���U\D���������D���qt~�N�������y�@l{���;�*��O&i~�	�'u`W�X��	���B�W���K���D�N%IQ{#Pv)�Q[q�Wq�g4�f�Dq�+��'���Y�>x�\DD��@���gZ�BN�(tL>4&M�cZ����YM���^������A��	�A�]jQ-O����������r��A��Z��y*��t�lru�u`7Rj������_����y�dHMl`e
�)����w_��� �n�!P��G��g�2�Pk���-�:��Xh�|���1|l_�3��0�b�� ��c0ZN�����f�����Z%P+Z���JV��a�$uA��@�u��`�9�*E&�3]Vn���&E �L��:��T�|@y�Y��T��w#���A�]g�~}�u�x�@�uF�s���{�|Q���A���X�.�����t�D�D>�^j0Z�J���5^BD �2P��������;7�2���c��+�T
���J�����|?�A�)u�)���^�I�1�qr��������;��.�W_GD �2P�����QB/ns������\���4�|��~2����A vu�<
��9�@�[R�W�Kj���9�����S�����A vu�<��&�G��=��.�|�N�	oc�O��u��M�s�\!�jS��������^���@<m�<�e��l��5�����Rd����������V�d5���gp�m	�A�'��1�I���A����y���cP��YNQ��	�A�'��1�N�s�iw��g��&�+u�������Ou����:{��ba�)'����f#{���|!��+�����"�4Pa��WN�����+Z�Fk�$w^�
92�,{��������f�K����
�A�,K�8�+��H��(dy}�3�����K�f����F6{�6��"�4Pao�X,S�c)�P^�5R��e������d#��>�$;�9��_�:�@��@�=�}l ��+B�Kn3wN�!���vJ�;Wn�?g��:{���W)�������k��\��K�m���HjT����@<A��`�������d��<������84c���3�x�@�=@�6�d��
9W�����M�EW/�����o�@<Y��0�0R�\!'����,93����X���b��a��]����*�.w"�~�j�}�v��B �(Pa�U�S�O%IO%�P�c� ��b��fU��N���W�/�q]2�P�c� l���[��T5;���
9QB�y���r���oj:���r]�s�� �x�@����-s���������������Z}��z��%������_d3�=�u�@<V����^��6sBj��m�
kc~�d�������yp!�|��Vb� l����_%�V��I���=�4O�{�������2 ���A�Fj��\��oi>���.���`�@<n��6[0�JZpN�>�<��&.Z
.Z
(P��P��+XM�*���V��u��#�Z4U��[��W�~������"�m	�A�F�s�����s�6�?_!�%Rc��.C���Ye5�Z4�����@{<@�&�/��Kj��
�
R���x��x5�x�@�m4:gp�q�9��"C���wlBf�7J�l�v��ESl��sC�%1�7Ax�U��g7�����,Ac�}��u�������%��Ko��_��wq��C��x����T�����`��e��t
B �6Pa������'c�:8�h�y��d��l����)��x����m����6O�M1���M^�����6K��u�y���
���Z� N#X	�A�F�s��)3_�N�J����(L�eYcQ���&��Vim����d	(Z�@Y�m�,:�-R�����]2V�����j��7���VVup�P�W�.�+�XMG��e7�B�y������80i�)���yBf�7�U��1SU�Q��EZ�m0�,v�a"�d��a���=3m��Z��5/�
9����z{���
��6���)��ql1P�O����u`�����L�����Jk��������������'�BN���1�{i���OIW��-��q�5�G��<Ok�7����_*�����{0bO��^��U$���������o��q��pgW��+��g�$W�#�J���.�&{����O"4�������m����}T����u&m�T�������e��zR��K��Z�-�T����:���w��t�����m��=Ig���\8�9���e�q>}n�%R����s?��cn���L����=ixv�>>����NW����VnM��V/������n��n��upO������s���������u����W���l�v	����$R����/���32g�V�tl��=��{R����
!W�	�N���/?E���{�����p�'��U���Y"5P�{���tr�V����upO�>�W��0�Z�U��O����v���1�'����Ex�Hkt��>���v��[E�K��9{6�X�����
�">>>((����ilnn

�~�������o���L���	�W���y�.Cv��3rSM��;��U���P������2W-��� �\�+ou����\)��G�&���%���z]H��cm����<�yl4�9r��	�����y<��/�����������#f��Y��S7���7�tF��bf5���yVyV�I������R����?�w�M�F�
y
{�t�l2[�R���e]F��f5��7��O���5b?�]��'�|����3u���;v�<...����h�>~�xuu5M��������6N�t&;�F����f�����r�V&�f{i���jSPd����q�7������~�����2�d^�[�NG��y��I;u����K�L�\�
-������^���`||��db� ��uvv&��r�7��M��
i<u�Tdd��6N�t&;������9�HF!K�F	��SV����a����B���m\���=�����t�E�{���+���i�iUoY�[�5lm$��e3��t�����K����A����q8��d2}���i��g�e��zzz���X�v��Mg��������;J�B������}W��"����2>�����+ua���J�p��o�j�jh
/��,�d5�n��=��5n��`�i��E��l�mC�*yF��EV#��vj�P{�MVS������J��7y|Z�f��
S�������c�V{��!��<���H�����K��_�q��3�	�$��m&���/��g�I��vz����2u*I��kK�?����O��K���T�5[�a����������3�gm)�J�?���[v���p<���N�<I���899�4���4::J��_�x<��n���L��d��W��/����e|��������(
&KH�.�\�v
O�#G�Q��?h�� Y0L�F;�A��b0YfkW'?�����?`�
��P���^z�%�H�R�>���/�����/�����J��D������8u�������.����TY�j���;a�\4�!U?��������O�^���/��]�
Ia����k�|�j0ZF�
R����3u�����0GG�|������4MONN�����?������]���%�N�8�k����Bjb�
�����N
�b�J�rv�����S���:!C�M�]�l��6f�?�Dj6�-l�H\��`�!)�k�<�h�q
,i�z�5r�����������8.5N-YMk������� ����YTZs5{���������`~��:���A�D�X<@���q��n������a��}J
m\�g�7���K8��i�2�Vl\��&�
E��m�O�d��VR:n�np���<V��Z�F�S�`O��6q�����F���k]��'��w��NE{���tM��4m�X����a�G�Z�/�e4+����~I}�0����������cuN��BNBKvf�Df�DbK	�ix���w�)���*K����Z�x��S����Wa�mR��)����^������G����=���<��t�&=�$;����A�M����]����{$�����6�h)�z%�E�����]a;�iU��iZ�
+8;[R�G����F��A�!�P_�o����1�Z���Z��l�
n����m��|`g9m�����:�+�)�
R����Y0��]��)�D�p�h�`�������<�������>�������<x��?�����=�`{���J������/J����NOO;::�D"�J����:;;�o��������y��_����m=5++�����ehh�;����W`�:tH����������[O
		qww'-Z����6_e�m�����L&�������o��������L���{n�<����`%..����$�������OGEEm=���2�
�����W`���G?���"������677���{��������6_e�m������x�������G�1�4M{{{?h��drrr:q���s�����u������
ia���S�RitttPP���GxDF�1==����333�������_��_��������Ygg���z�����=������������^�������{����������������3�<c���L��?�������/:t���?���c{�CCC������������<��{�����������>������g�=t�����_|a�������?���^{-''�i���lm������.k���~��G�;w���c|���������o����������~744��jM&�B����tttd{������U*�999��J���������`u�999G�]^^�����������Z��G�������_������/�r�����H�Fs��5�;9������K%%%r��d2i������?���?du����o������D"�������"�������W�e�l���<xpuu�Y�V����np�mw����Z;w�\PP��}����j5y���������6�RI���'O������6���������4���Fu:�������>��3dd$�N��/���s���kT�Tdu�����r�f�����/��S�o�+��l���W�x�
>��R���_�����=i<y������^����O�8��r_}����&�������w���|�����������Y�R�fU^^���P��
�����g��������bu�)))'N�P�T|>��������iztt��W_eu�qqq'N�X�(�Y]nll�'�|��f����?NII�����W�e�x��O444|�����?��K�.��g?������������f��`Itt������'���9����{�����$��X]nhh��o�Y]]��������6�R��j���������b�|������������z��O>qvv����]__��r�z�o~������;����������������PV�+���9r��1ggg����~����^z),,���*�7�x��������}�{6�>�O�W�e�~����T�������u�����������___.��������*>�/
�[:::�|�
��������&&&�������h�E���������1�|>�������x<t� %8&&��E�V_�v��f���vYMOO������3WU�
O�����Z,���������N��������111J��f��r`YXXx��WN�<������>�h4�����X�m�K�����|���g����;yyy���F�b�����/�������~���>���~���o[����6����n����/���g����m�QQQL����_�����g,�6�%?�s��-r�����������;iii2�����>s�i��������=�\�������������'��i:<<�������}����@��&zzz�=*����?c��Y.M�/��s���l�����+W���>���Z���������x�h��5��\�����7�|�`0����4��'�$$$`������FFF�������d{����f�4M?~������J���`�������:�������[�������[�z��������+�X�dff�}� �k�������?����?����x��	___V������;�X�>|�����\����d��
endstream
endobj
26 0 obj
   14898
endobj
18 0 obj
<< /Length 27 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yP���?�Ls�����tI��4���z7���v�M'�Ng�m�n7����u������	&>�a���alc�270�
�!B�>,n ����n}~c���j%��#ax>�������9��>/�����#����������d�I���,�����ke6�����"���P(�������Z��drrrdd�^�����\nPP��l~�EXm�=����j���EDDx{{������
kv���b�o�fO�j�������^^^���������j4���������M&���SRR���wnS����rooo�\nU�T*���^�jy��<�vww����k��<(���������|��l�r����}�GGGY,���oQQQYY�^���y��3������������fs||<������A������K�������:����w��b��|f6`_XXX���***�����N��N��v�K|��������R�������X����G�����uuu,����N��������m�M���H�b�FFF�z�H$���c�XtwGxxx^^y���N��`0\�t)**�Q��RFF���m�Q�A+2���bedd�ZCC��R*��:�}�m����#IMMe�X���G}��[�X,VSS]s���U\\L�^^^����quu�������z�5==���<)�-���imm�|����wHHHuu��h$O��RX}�����������.\��������RiRR�����rss�Z-=�������+W��?���;<<���L��m�����^^^111V�t�hTT���"�����)))����p�BNN�����sFFF�������lv||��������x���k��YuY�]V�j�Z������8����,���~YXXHKK�������x���[����-�%::������J\\��k�6�d2�###����������B��[��������@bb�D"��`�����LKK����q��������aaad�JJJ�N'�T����������'$$H�����R�4!!����AAA*���	�/��k������jrr2!!�������111"��j�V�����Z�����m�*�;w������1�O����Y,���]C:������� �����l6y������K7v!�����/***,,���%�S��m��+W���y���"44��beee����������X����|��H�SqqqUUU%%%/^d�Xv�R555�X���z�z��5;;{;���z^}�������X�K�.�	�4-������SRRX,�e�"�CpppaaazzzWW}$$G!��������,����e��������������zzz���ng����\�p������VUU%&&�X��/Z�� �u�����!:�����>vY)//g�X�/_���(..&�7�������������%o�)[�b�u���QQQeee�o�&�"��������y�&�_t��j��/_&��*++������#^�v���2::�|t�E���o�O�j(����";�����$�+�v���7m�bCv�tb������V���ZM6����bcc����Xz<[ff&��~�>�!�J6��P(H��� ���6�`ii))NOO[}%�b19��bOO��JLL�O��r������J&''�X���a�J������j���`LL��` ��������<�c�����dgg[��9466��v���ZYu:������'����)OOOrL��_���-Z�o�@ �k�	�
��CCC,k���&�|R���a�X�����YPP@v��l&Mrr��r������x�$��jjj,�@����Nfk2�H�x�<O��f���o�g`�O�����!((�����[[[d��dt���ogoZ��c����G}�������#��O~(�������_fffFF�����)�c�X<o��p"������kt:���{�y�>sg4I����U��+������m��I���=Et�����6���E����5���,�����O6b�8))�vKe2��l7,�9������`9����500��~�����655����Ba��l����_�H�����	`�����tMEE���%H��b�uYYY�O#-t��j�R�b�"##-�`y�i||�1����]C>QQQ�O�|���A���2�QUXX�b����6������xy�"�\=�����I�,//777�������/��+W"##M&��h,(( �M6������������:A��<h�C��X~e�f��jo��C:[h��������l;�Q�~���A�+��Z-is�5���������U������A��kui��$�M��E����iY���
Z�z�|||6{C"""����������s�����%���<�Ym��]L�#�\"h�����	��j�1�h4��-�@�%___�'X��Q� ���s�������i�J�o��cyU�^���;�H;��������������6� ���Uo����2�$000==��������n����G��5v�����4T��/�h����p8EEE����������� 9yG��#����&����h4
���7o�	,+$$���d�me���wX�d������p2OO������[��
������j��;y��v�=j��x{{������}�<H�lLL�fIj�m�ro�n�#}�m_��O ������t8�����l��k��
��f�<tl�����G���t�j}}�<a�N�.�X,�����I�
a�sH���V������l�;����X,�Atyyy,���m;��������G�u^�r��'�*=��
Y�>�������F��H�yp�]�����$�}�V��+����������]�V��-?�[����6�����]l�7����� 9ei9��\�����Lrss}}}��i��
���xK�)�'�����f�.A���f}A��_g�Xsss��V�",�h���b��H{��43�LW�\������n��4j��d��)9�����_:::��1������mv��t,�9uHQTHH�U�,++������[��
��\"j9*X�T�����V2l����~9���y�r�	�����%�-?�[�A�g���-G{��������#�s����A����%99�E����1bjj�r����*v�-� ��RFF�R�\[[+**"w�x�<���G~�����t���~��f�$7<���x���B���?<<l0�������BFFF�~9�V������E^B�*d��9r�����$��illd�Xiiiz�~}}�j%�a$;;{mmM�T��|||�v��Z�f�����B�����[�<�G�����d2�����|;o��q������d��$3���-����x��������:�|ww��`��d��G?mnn������o``�`0����!�o }S���~___ooo����i��u���t��M�B����MZ����'������;��������s8���J2f�����
\���v�ZDD��������
���b��t����O�/j2�,oT������@���W�Z�<����B��r�E�������O�D���+Er�����@���:zr�1=��t[Y��J�"���bcc�k���/CCCVK���,J$�Z^�O#?�������zr<�������n�VX����x�����I�4.�k�K���u��yzjuu�c�A���}}}�k6��i�NT��V����n��[����;�������������a�T�]���`� �[�V��@�����F���%VO�N��d���omm%��jkkI[���d
�����&�n���5��###���|>��F"���"��khh �4Mmm-���v��������:�_g�������&2P�v%�f��������-O�ns�tvv���755���Y�����~~~����r�������:�!��<55���\WW',O���V+l��_*�J(��s����������������v�f�'OMM544���Yna��n��n��l���$�������t�m��7w�����'��������������u���/Y������w�w�n��O;�)�������a4#""._������O4�L��s
<u���===����`{{�����]S�<�[^�r�:�������W�#�Hz��x���ng��C��n�9{a4������������?==��W-//���J$��l6WTT4551��B�0~#djSS�e�P(�����������OLL���&�&����-555))�����������������8W[[[MM�R�4�������
�b;/���LOOW*��(
����M-((�������f.--edd����D���Ff����������F�X,���6v��������@���������bkkkOO�B��+�juww��?����|��8==���F�D���Z��{����fgg-_277���544D���2���������?���A���������o��E566�����SSS��+++t}cc����f����VM���.:?���WUU	�A�����qhh�d2��������}}}�O6����������nuu5++��������u������x<.����^WW���NZdB������������o������Y-�N��d������������[�Dz3)��r�kkk���]]]SSS�k-������w`��������rzz����<�J�999��~�v������l�VKQ���bZZ���ZOOOii�f�E��������YV��������F����[�233u:�������"9����N.uuu���VK
������knn�����!.�+��M&�R����%
�mnfRRRMMM}}=��/))���%[ZXX(��M���������l6s�\�\aee%�	��h�\�L&��k�n}}���[��{B�0///55���F())�r�t?'EQ���t�***��������]�F���L�b������VVVV233���			t�6�����d���|FFFss3��/***--����S���}��m�iiiF������e�A��L�V�-bii)>>���|�W�����UOOOyy9��KLL���$guuuutt��5�			�	b�|��$b4�����L���"��'y��(�R���R�lll���nX���~��M�J���J����`ZZ��>�a������j	�����Bavv������T*Uff&�4��UUUt������&�FGGIG(!�J�\�����������������������k���(�(���������l6gee��1������������U2����� ��455UXXh�����UYYi6�222,[�fWQQQww7]��h���&&&��������E��$��rrr$�N����0�������D��A�Z���>;;K�VVV�
��l���TQQ�����d��YYY��������U���������>O;�$����VVVn�����222�.N
����YYYt�����<��K�I����J���<r�@fffBBBNN}nqq��������"/�f�s���Baa!�~```��I&
s�������B�TJ*�juuuuRRRJJJss3����4sss'''7\�A�����?z����?��Q�PH*o��q������'?�H$EI$�>��������z\\��L6�j;����������h4���nnnE������
�R��'�xxxP���Je{{��/�(��,gb;u���Z���/��"EQ���'O�$�"�����.EQnnn�����W�����|���
g�;]�t�����322�cR��h8@Q��C��Z-����


�|���
g�;%$$�>}�{�����RPP��&�F������(��g�5�����/88����S7�����9!����9�3�=�~�iTT���O�)
��\\\���I��S����-_h;u���6����'���_��q�����w�%5UUU���(��?�quu5������l9���`����{��7��?���������VWW�F�����'��;���J��dff����;w�g?���#GEQ���d&�S7�	�.���BBB���U*���d111aaa�����������x�BAj�<���
g�M�U�B�[sLQg�8v�6�5���f�I/Ysxlnw���w@�k���M��N�<�"�"��L,�z������=qK"{%���z����9<v�p>�P�������;�R�={wp�y�'�j�m�.�#�VO�P�sw���L�`7�)%��e;�=�v��v��d0��I���7��^I[^O|^o|ar�@j�Pf�Pf�`FQ
����!go=�w��<�RQ��%vp4*�����es8<�_A�{���Q������}md^���R����uFT;�
�}��Cu*e��y��)%�����L��z����������U+��*G��D���v�u�J�<���.��]�<�]2����2Z������c�d�&�^��R;2&����P�T]L�?�abI�?�Q?Z�>]O�=g��xg�r]4��>��<NT;��n`Fg�H�K�TarVgp���b\����������&����������f8�i��i�"�=���G
�+^�z0dE��,G�j��I^-L#�^��K���P2��\1�����<N�7�Z�����[~y�tL�C�)����>�����/�y
y�B�j���q��f��NI�>���*�Ap��*�:�Y�A���[mJ'A�A���_=�)��v�&�WD�-;&�._N�_-:�%�<��VO�H<Rg���)w}��Y�<Db�*�/�<�Mb�j����d�`�A{����z�� �K�R�M�(��B�
�������&�tm�m���c�%�����cw�n3Z�R�5�kx�Ap�>	_0�4"�����]���<Q5�$b�����j.�uO�!��%qx���������l����i��c��Xu��}���Ap���T���s�J� ��XRR�p�R�h�'Gt6SD��6"��d	�W�\���u��>8�V���@Q����Y��Dp����� ����[*+<��z*y��I0 o���9��F���:%y��]Nh���f����������|*Er�t9�f�1W����V����<�{���;"&���#C�.����R�0Z�T:����=�=������v�'sxlr�����aY���bI�h1�DXZ|���\�=�A�A'ZP�����_��"I�n�(�n����{����)�d� � �s�v�\����eG�D�\���a�x6c�����<�<�f����tO����G����,?�2��:�����d����<��tNh/Vp/����s�u|�K���&�| l�F���;�����$V�(�o���uy���6s����L%Ou l&�H�����e
r���V7V�[�����������:�YU*pD1�EU*�_=����	-R\(�*+�*m*����m��r[��)�+���
��S2����?�-�P���KFF�MC��f�����@�<H�R�D�{�����
����{�z���m�i��d(����Q4�o�8=7!��E
z�L��nv�84�����/<LO����Ic���*B��Az���{>&�����}5`y`3��d�I��^�lV��������CCC����R)�Q(			aaammm��ZZZ������VVVlgb;u�����7�n�&�
"��yRS>��h�b�0=Nf����~MOU����:>FFF~�[�:{�����������������#'N�8{����Knn.EQ\.���_>{��O��#G��L&���N�p&��������w���������c�Q�
��w��$������+�������'###����;FjJJJ>��C���?^SSs��Ff���G��x�S7�	�%�U�H�+�i����K�!i7���r�{�ypo�s�~��'			�����,--}����(���E�P�Jww����[��v��3�=C�]�;���:�\�����o%7���L��n�<�;�/� ��8�P(|���U*Upp0��&�F����EQ�>�,����h�Z��������{��c�N�<��Rm�_AIO��k�����e�S��N�����q2�����_�R�VS���C�5��C�(�:x��V�%�����/����L����	�)�S{/�9��c��u����J��|�~�"�4�i%�Z|�bl�D��"�o���������)r���'O�����nnnE�����������7\.�r�S7�	�
�*��n'�A�'��[����[j�dZVV��c��r9]3??��K/	�R��g�}��E}������R�...����3����L`o`�.����rO�^�\��YuL��^InX�lV�H��l8>���>c�������W����8p��>�H$E�����{����?|�pll,y�3�<Cl8�v&�7��,�N�8�>��,��)���T����*lW1Zl\��g���A�����r��~��v��X7 ��D���L�f�84��?����B~9K��.�q2��$hl� ����L�U�~2[����	g��"�;���e{P4���#��k�������gW&gW&��+�.�dK�q2�������\s�S�C��i��
K"����X9�E���*�AF�d�3������;�;�!�Z�������jF�Ae�t�W�08>0N������PR~6s8���+G@r+�����[-����8���d���d���V_o����R$N��Y�@��L���6�����?��98�J�g4��1�<�<Nt6]~�v���(��y.O�X���ZMm\������`1�~-���iP�����d�8���������)�P����2W�i�G�h: 0N��3ir�����r�d����?��'`��49=Nyp���S�d6�M��F�Ak�0Z|����v ���y�2���~��d2�L�>I��f�x&�O�d��d`�<E�c[��2'�DK������A�O��,��8������8xZ���)��H�[?Z|g��$�Q� �E�<!���������l��
8<v�b"�?��>�h����S)�$N���f�� �~�!�#�r]����������������bi�WiS�>�N?\#
�A�����px��-�N���{���S:�p�`"0N�
A��y�9�)I0����{50Nv?��|*EPR�_\y*YY�Y�R�)T\)U0Z��|%�z��k���kC�q2���
f�i�U������sx��f�������T�rO;����^��]�~L��Ru�v��499v
���z����"b�G�� �jz��T�'C�Ab�B4w� ��a�h�P�uHy��~L��q2�����.�YU��[��-����L����z������On]�������Q��HM�@:s����41'�<�`"&e���<�
�9=���9}q��D�<�T���3:�8<���verve2������Q2Ny�7��O���*�#<}��yJ1:��Y��Ry/gX�7�$����+��\(�FD�����Uv�K�x��L�A�8����T��w���N�O��h�:������,����{%�5Ny�I�/=��t�{��Bam@I�G�,�V��
��J�vO����:���@�H <����sx�kM�N���W�������/����O#��|:}�n:2N���U���b���<�V�&�d������^Y��f��=��Q�4��7�g�xg@S�W��f�=h<�H<�5t6S���Hn��3�����+N�3��?v!;8?��YQ��~Q��c� �SgEm:�9��������'	��<��D v*0N`�g��o�d=�8� d���s���C�� "Oh��~��X��n+���%��������������<!��4��'�N^K^Z��X
��+�	�]����,ED �<���J�vg�,��8�D�������C��C���.��7�'��9KN?� Oo�_�	����J����rVv�'��T�<��d�x&c�<��^q���xz�d�P\���c_����|z���&p�O���A�'��1���RZ���$��X�jQ:�0�@<�����M�0�Y�'{�L�*��o����u(-6j���i���!�����M�0|�6TVp��:(_A�S������	����x7�F �4�/
�$$
���A��I�A���<	�����#�d���1:��U��eB�A�#T�U�L��L�I
���b�H#=Ny�xJct^_��.��r[U�*��PD&�tk1�[�.'����,^ 7uqO�#"OoL�����/�kz/.)��>d�^c6�����_sx�����%���(Up�����A�Ab�����.*5&g5a��^\������rx���&��Y�����TF���Z�C��#� ������/�V�{���I?�"!����iF��s��������x�yv��LEV����S�o���xo��#�G�A�A��3���cG��q|d�,�M�&���yv����t����Z�������Mn����fpF?8�g�xg@S���/FK�@<r ��������e��b�@�46�ge�_�s�����S2���@�z��b>�O����\�����i��HR����F��@�46���k'�d�O���U���F�����;��X��8�X��7�(���G}<D5�09�������E��@�46������<(�~yU��x��t�>oU=�_\Q����7��������}x?(��i~�D�����`�(�}���RMb���L�Th&e�_#�<���i���S
�oMoT<��\1�3�'��<8)34�G0[D�����g���6��k�Jo4�L���>KR+5f����������=��{�R������gW&$�b��������G������M>�Zu<D�5�WM�U��	j���k&������,��Be�����)���z���L����=Lo�M,����bZ�kDy�n������[���jq�x��f�'qj���O��J�����*�yl�8����&b���^��W�'�V��Tsx�����i����K*-�U'�<�u���?���FrO��>�����,!b�yp�j���R�S<RgO����t*e��"}�`!_���t���}�>qY50�<���@��tsHY��q��N��g�-}��c�s���xu@����=0ha���L����=iD���m�tl�*ER�,�s�u�����!��/a���=0H�5�L�?�Sj�~orP��ypO����'KtZJ�U�tO��;�M��������b�����fd�	=WRC��W���{��!�3��z�Yo0��/j��9����^8���=.58��pL �I�������������U��������`u�<�`�S<<p�u�(�-<��1q]�r�cvwD��b��>�Ts��a�J�P���~��=�wZG��"<��1�gI�!�o�3)39��u��x�*��#���f���nZXv�w���<�'�N����8�}�1N���C>_?2��^�����Y����wO���!������'�#�s4�CTe]��V���{R���WiSd��>������c����;bHo�uM���:N������c�B������FW������%%%������v��3�U��sxl�Q��bdC$���S����1�V^�������;��k�H�;`q������`TT�3�<C�#G��8q����...���Eq���_~����?��O�9b2�������Lv	��,SJ&�D�����������b��� ��������������9a '��xs�������`�X[7U��K;v�N�
%B]y���a=+{�S�up�)y���?~����<���t��1������?�(����555$�=z���g;u������C��>^Z�8����c��$$7��,F��/F6D�J���vt�[2Mf���3�K����l40Z���1�3�f���~vY��fZ\3�,��l���{�����Z_���u��a�c���J���`4�<��p<<<�����o~��E���(
R���~��u�9�N�p&�A����c_���'��r:���C����W-^�������,�jQ:2	NJ
�3:I�SG�jJh��?}�����4�/.���a�����E'��`pp0��&��F�7��
���}�Y�/���/00����S7�����9�cu�eu��P�����e��/rX�UO��+s�[
�x�����u��y0<<����<�h4��(����Z���I���#"",_k;u��8����;Wp�6�BQ�����4y����;E������:���l���\.�����������EQnnnccc��7��
���|���
g�t�Q
��B����������Uq���b_58��+ae���m���<8??��K/	�R��g�}��E}������R�...���������L��mD���
����oL�[uJV�7������(����'
+*Y��)��z������>���DBQ�X,~�����������Z�d���3q���_A����tV��������|�$_5�3[<��n�_��m#vA��<X���<�3�'e�����?������L��r�J�<���8���=�F���:p���?��qp,������0n����dz�	��b�����_�����>�<��I��Mog���~�f�(Q����?�'��v�t��:�H�
.�=�Hh��
f����|����"m�����c��t�g��3eeLs�$g2G.�q�
n���w�T��c&hmS�;���\�PT���{��T�Bq���o�kM��%��_������>�I�/`�
8SM��)���,�Z�������a`�+k���\�.P\�]r@������u�*n�	N���y
�wv�A����K;u��Nm�9{�`�S�L����@2�����b_�A�*�y�,����].�HF�����u@qq��5i���k��H�����F�����I$�>��������z\\�v���q������O~"�H���;��(�����_|Q&���:;;���*�J�'�|�����uxRnnn������~�+.�kjvv���'I�H$��w���U�1��j�����Whh����.]���!5����_e������F����/88����� 6�Mj�F�s�=g;�����X���g>�=���e����S�����r8���B�pqqq�*��������������f�S[ZZ�}�]RSUUu��q��2�����tuu=w���~��#G������7�j4���N�8q��9WWW�q5O���������x�BAj�<��T�Lf�x�m2�/_���+))y�����/����KL/W"�xxx����|���\\\�?�����r����y�����XYY!��0�h+�<���b4?��3rijdd��/�|����>�h~~��E�D������"�Z�o��o|��w'&&�^nnn��~�������:t�����/�p����������k�����KW:f/[���+��u���~���������;v�����?55533��[o1}����������D"�Fc4
EUU���+��=v�XPP���Znn�G}�T*�����f�.777����+++555��;::�R����?��CF�;44�����9s�_��_�;v��u�Z�� �����^z���tii�h4j4�����>���>`t�555o������������z���~�\����C��r�����ur��f5>>>V7��q ;���s����]���?�R�����������r��J&��<y���������1��/^����EQTHH���7��j�L��(,,�,w}}��g�!wF�j�/����=w��U�R�<x� ���x�"�\����?�1  �{y�}�p�r�!k_y��7����J������� �'O������u��a�PhUy�����F������E>����������P�����_�����R�t������G?��F���x?���Ieoo��o���rSSSO�8�T*����{����9�����^}�UF���	�J�@�����r���>��c�h2�>�����T�������!����}������������EDD���?���O�����������`HLL������'}||�l����������&��]ndd�w������&55��^p��J�R�={�/��/s�2��_~��k����?���7�����=<<���o���3�\�N��������o��v^^�;�������� 22���.--9r���cl6�������/�����W]�B�x��7���S�������a����
�,����Ry���k��Y�Cfgg���/]���pZZZ���2������Y������&,�&
�1�y�f�YYY���[4q�����P��2sCSS��	�r��AR��7,kT*Ull�M�o�+�l���FFF&$$�gU�nO��>��5��������t���^4mii)**���kkk[��O�T��+��<y�������}�ZM���-�r�\�������9s�������'�7"
���^|������}�{�~�������z��g�^Z.8������`o�?��Zn||�S�����FGG��U:::~��_NNN:����:f���xI��������q@|�������r�[o�u��iR����6%%eO.`aa�o��o�����r)��v��{����
�v������M���=zT.�3}|�r�\��^x����������W�\q@|���5
EQ�.]JHH ��^�Y�������w������r)������\�477�[��kJKK������3����Ru�����@rE*EQz��G?��~����;�����aY������oZ�`�3���Wdd�+��bY�������X���;11�O��O/^�kO�8��r����~�m����3}����k��KY{�
endstream
endobj
27 0 obj
   15958
endobj
19 0 obj
<< /Length 28 0 R
   /Filter /FlateDecode
   /Type /XObject
   /Subtype /Image
   /Width 600
   /Height 371
   /ColorSpace /DeviceRGB
   /Interpolate true
   /BitsPerComponent 8
>>
stream
x���yp�y?pOlGO;Sgd�����iGq����$c��L�L�&J�f��I����N�����A��("e� )o��[�	�x�7	�H�A����z�-
� %s	�~�������������t������$����M//���Z��,--���r8������:�����}�d2����E�tQ"�LMM�����,����������p�N���y<�����dz�IX,�=����l������zxx������5O���b�+a�4Miii@@�����������j5���`(,,����p8���F�������W�^U�T��(�������cmm��R��EGG�L��������+--��sM��y���a~����������aypmm������|�!M&�����{�I�<899�b����������t:�u�?��������0�L���_���X�T]]��Y]]]YY�p8uuu�e�X����,�+XYYqww/**2����$	Z��J��pX,V]]�L&�h4>>>,kzz��sn�����g������}uu��`���,����Q�`�`]]��*))�S�����������066�b��^�:11���������n����%�sss��������������Gmv8P���Y,�����������ILL��&|>��b����W

������:p���<(�Y,Vzz���X,�\.���<h�M���IJJ
i0�i%� //��b577�5���,��������������jwwwz���z�577�����|Ikjj���n�����X]]M_G���-��d�L���������������VWWo�������������h�1�T������0///������2�V��Lnnn������X�FTT�X,�y�IJJb�X{_dA���������|}}���e2��0IIIW�^uwwg��������c����������h&2_�o�>`� ���qqq�����Ok�
���B_�v�Z^^������DGG���[TZ����������hljj
����������b}}=55��������ez��Y������TOO�[�n�z�NW]]D�����r�B�����z����WBB����c�������///����Ba1��G����6��!����IHH���������������5-fi'B�p�Oyzz666�L�����>��s����@���z�a���uYY��&�777����SE�#_����STTTXX���M��d�=�����;w�TTT\�~��beffr8����������UPP@�7�+WUUURRr��5�e�"Tss3�����7��J���f$����UWWG������U����@EE��e�X!!!t`�X������:99��b��.�1�������GBr2_�����n�b�X�n�"(����sss���.x������m```/N*��������x���*r�|��5�_#�����e���j=<<�c����r�u����������W����,���_llleeell,Yu$������p8���eeew��%�"��������s��^t��h47n� ��*++###���#���EDDTVV�V���0z��6��E
EQ###���d�������EX/�^���R�d���N�;�1���J���Q*�dAH��������ull,��-##��f?F��~!_:6�-�HH���0�����`ii))���Y|����������Y,Vbb"}-`mm��b����4���n||��"X|�I���T�����> &&F������H���@ �X�f���,�� 2���&���,Z��yQ��r8777zr���nnn���o������:����k������������&������b��_�n��d+�L���H�h��>@���������u�{EQ555�+�4�������F�25�����j��D����u��]T���������N&������l���z���5-f�1�-V������b���c�F�C�|||222���}}}E"EQB���b���]�
�9�K���E�h�Z��w�y��rg0H�T*���9[[[}}}��z��n�� �y�>5��t
��\�����h����yS���������X�������}�zU��b���,�9������`>����544������$'k���;�=�H$���Zqq���7�U`��������X,���:]SQQA�-Yl������4����o>�\�b�����0��455�y��'�"##�0����A���<�QUXX�b����os���5����A���|�H���]������������E�aaa���F��`0�+6��`��UUU�5��{����I��7�b$�����111�-���L����. ����W��5�2c�W4��;�4
9��k���222BCC��_�3_R�H��IS���
�)��o��pZ����3������a�{'��yzz�\E����9[���%+���-&&���qcc�b1��fk����O��?����E �����Z�~�<h>������0����A�emjll���{������A�u��
��t�[�9�%�t��]�����]��s2e���h���~�^�'H���������I��������OD��@�V,RRUU�N�$6��L�>5&�����\nQQQ___ii��U��y�\����Y����`0�;w��7Q��+Y4[Y g����bqYYYpp0���[AAY���������#i������N ��u����A�}����z
_�]� ��111;%)������^
���m��� �
�V r�g��A&���l����]L�c4?2�z��_C�
EGG�IJ�R�vj����e�X;=h��"�K{���������i�!,Z�����az����is��<����b�,:�����X�����l8�N�%+$,,��'����`���E�N�������f"=y���]��KG�����Z =�������#�7�Z��~v������{����b��i1���K��_����o��9���ooo:o����i�NI/D��>����� �=l��G_���'**��b---�}����b�RRR�C�L&#Mp;=�����-Z[[��9���FcXX�y���<H�x�<H�Q�YYY!�^�L����q���������u+����NW?)�JKK�s������@�DYVVFw)��w�l.������\.������nK���������A��s�v���C��~v��d�r8�������g����5����A�eIJJ�{F�dm�m����5�}d��
�5�g%�����r�LVTTD.�=v��>��-��Z��� }�h�3>�<�����/�N�		!=��������DJJ�N�����������>>>t�'�r�fcc#33�|�~T���9r.,,��$��455�N�:�N�RY,9�dee�d2�\N����v���R�d��nnn����\[4��G��6�sb4�lv~~�N��n�u������h���'�q~~~�<��>`s��������������z�X,&;'=�������������^��GkgA�W �P���Aooo�aa���^��'�Nw���H$Z��<�����=�z��l�}�xyP&��]���VVV�>K�������FcDDDhh��e����p�]����<i�"����`���h4���{CC9����knbb��o���@Z�,�d�����_2��������3`iXX��w����~���}��Fc��
i�2_�BAn�������t���FFF,:�DFF��Hb��JD������Nt:9��<==���vZ�3��>��6���![��3oF��������������n������������a/yP�TZ<�������js�w���h�5������_OOO�UG#-��-���`}W�c�[}-���= _���������!��`����Rc4�������������3r*a�h4����|F��E�?899������855e��9���L���������N�����|>����t�S���������'�����uuu���b������f�Q�zAL&�P(l|�b����zzz�������B��U-�N���ls��������w���d2���������uww�_Z�^3l��M
�@  ��fS�T*mkkkhh����i���xvv�������|��_�9��,&�iff�l���~�6�������_��1;lmmutt���uttX�:�������Y�U�����F��#�8<�H�hGG��g��`0�����q����'��8����MMM���1w
sp�`GG���������ym�������>�E�H��j|>�����s�P���}��G��a0������������?77��Omnn���,//�5&�������������w����+���,���B~~~bbbff���8�4����)))�o�.**��t�z�t:]NN���joo�������aff&%%E"����333iiir�������;�[PP���f;-���Fzz�H$���������dT������2��`0���gee��z;��������#<����T*]lkk����H$���t�R�������������|��077���J�D��4���hoo�����G���z{{GFF��:_z@,���Z���<h0�������N�IQTSS��� ]?;;���B��J�t}SS����N�sss�����t~������Fx�555����������I��������L���������,�|������������������J��|>��������KKK#gd�������wvv��{799Y&�YL�N��YYY�����Dz1)���x2�lff���wvv��lbb���vvv�5`�~aa��������+))!�WWW����j��SZxH����,�FCQ���zjj�L&���/--�)�h�������v�������	��`0��������j�������������:����
XLE ���
����%��FFFx<�����h�������Do��y�����������������Z���������"�������To2�x<}�������P��<O,���N�R�������AnnnJJ��	mhh���������E���tuu�	���h``�������V�����T��b�f���<hA*�fdd�}1��m2����I�H���������YTTTZZ���k��������dRSS
9������dJ�����>�!666������w��H$"'V�������9]bb�y$9����������`HHH�N6������`HLL��bjg1I������$�Q%���? ��������6�U*��;w
E[[I��������(}�f���Dg�����"�@���%
�?�P(222Hc��d����O�x<�mbnrr�4����<����������������������0�k���466FQToo�N�#�&�)33�t������������nmm��?����L�������-�������&�iee%==���%9�***����k�juff���4��������n4�bqvv����V�MOO����������B���J�2--mqq�\����O�������EEE[[[F�q~~>33�4�VWW���i�Z�J���A�<�������RYYis>�.
�dJOO��9] ���gff��������On]'Y��ruu���Gn���HHH����/���������������{��v��_YY),,$7�

�}1�[���</11���puu�T*�������o'''������;����������3�<�w���'O�|���O�<)H��[��;v�����G���E-//�����~������,Fb�]��<i�|����t�Z���v��q���=���-��?���.P���ryGG��/�(��Gb����<����_|�E�������=K*������oQu�����IR��_�������]�#x2�������mzz:y���I��j��C�(�z��4
�tww�~�������9�'SBB������ooll�����lRo0�{�9���}�Y��@*}||�?n����XXZZ�YZZb8����GEFFr�\�r�D"9r�EQG����$�������6����6G��9v���� y��_����[���o��6����:}�4EQ?�����I���������X�ks$O�����_����7�������lmm�����9s���+G�%]b222�=z������''N����Eq82�wm��	������
���111AAA��}uuu����������y���6G�2
=�u���%R3��~w��E�3�L��I\>�w�M�����dU:����i�E�MS��+=�c�c��M-S�L��
g/:��Mo�U�fs����|��PD�Xn/L����Dr�����������4��L��#�^z8�j���|v��������GF���������<�SP.i��Q; s������C�jBj�<r�����^p�5��������+4^��]XT�������.�td�O����@�h�Vf<�>��pw��W���^�9{��f������7�Z���WzHqM!j��d�Y��[TM�� ��
���|v�`��bu]���t��bPu2�ky�K���\��(���'� u�O3W��kB�'���]�N�G�7�j��F2GW�I�b�����^�v>yy��b$��g��(l��K����
������9�q�B�Z�>����\����V�_q��T�O~�oQ������"F�^y�.g��v�;����F�)�c����{Ny������N� 8�`J����[X{)]��<Q���$�<������)���V����o�H+{����X�T���.��q�~��[��(�vt[(� XK��
�������-��2'�&�Ap<��A�A8h���~2��p�$�I=r:�'��KZC�'��t���:]�t1W����a
��K#�-o�5
KW�������&.�����t��bb� ��Rw9cy��z,��g�NWOo�����/������#�C��r�l7^?� 8��r������ ���7��|6���Ze�Q�t1m���� ������\>;��-��;�������nTv1Wt�������<N�2���1������|�KZ�k�;=C!L��:��{����f����6������VY��[������o�Fq�3��<p�EUJ��#�|�GngX�fl���B��,�p��> -�BJ�f���*��@DTH��DD8h"*����� "<9���Z���-FVn~����/gL "8��dj��j���uz�N�>���K��-~^�Z{+������<�<�D�K��|vrwyY�|������F��%�A�I\>� � ������������
��������K��%�]J�8��D��2�v�Go��-��k�����1��ms�ms�m�L�x+�]����������EV��~P�\��;/T������kU}�"��s�+Ib�,u����E#���
��X���,�R.n-�Sg�*�����;�X�V�Bj�.������
)IO��������tJ�<��;
[7��Wr|�^$�X�98��j����r!� �N��%�
�\>�J������J�c"�R��(_�:8	�hS;=7!<	�I������/�M��s���e�)x|�������v�HB�R�<X���iSt<�|�P��G�{O����[�;8	vLh�����>Z6V\�x>y�?�]���d�lw�]�1W_�����sN8��Se�(���O��8&���	�\>[�W1Z�Q6���~!e�of�j����F;���J���
��x2	Z�|v�"�<z%�=��g���3Z����|��g_J� y� �:@KK�������WWWI�D"IHH

joo�kmm

�}��T*����6G��|$3�= �?�d���������F�W�E���7o�KZEt�p|����v�����}��W�z��'��9s���#G����P�������^�|��?���'�F��H���9ps��h�ol���������::<��:�E����|>�?������O�^~����I������������N�"5%%%���EQ�O����!zr��I�S<�wm�\F�����������<�,���p�0]�c������n:%�O��>�����.�{��R������|���#G�H$Ry�����(�Z�ks$�F�����/�M�KZu������|3g��8�@ ��~�P(�l6�4_���(�z��g��P???��Z�ks$�bk����/��8>	^+�L�����������������\�Tn?�(8��������^x�����k4R���j�q�wm�������B������6'}y�Ws����[�����y��g����:Pd���S!<8�?��p������������&E�w��Y�zrr����E?~\(��_���<�|���	���
�|��Wd�(����#��uQ�����q��O�����]�����+z��������\�@�fff�:ujmm���D/��Rww�\.����?����>������wr�������#����#�~��H�5�dox����o9��\;�*�Q�����O�\��O%�:�D�Y�e�(�;�Ko�`��@������X�}P����G�1��p(��y����G:����.//S5??��;�<��������%�}��g���Z�\�����6�Ifl�]�sJ������n����ve����bu�*��9������ck��Ihu3]�������*x��.�R����@�G�9� XhY���J&�F2�|v��tV2�h1��<��jPU����AW��	��SE�G��u��g���7�&����������d�=�V"��}!uy�Uy�
��M���������H��/(Lf����z)}����\�y�Uy�
K�����[��x��6���8�I��������hQu"�m]n��Z��1�WZ�Y�,IL�9���%�]/�8���`"�����$�W^-)w�s>?KY��8Xw��@�(<��xQ-~\>�|����B�ZP��[�$���<O�K�kQ���<������������Li����\�qH=����.�������<lV�.>��90���j�+�������S�R���U��q����@����������E5�������
����R��������J�^�h�;���g��G�|���<�dV2��"[XW��+V�t�p���M!�������d�E����
�Ax$�=�\>�e��<��y��~�xbm����TqxCxT+�\�
� b��Q)��ai�h6��$������LF�S����.�O�]���<��h2�un�+����;Msw����k�.�O��E�pCb_���wz�)�����Nq�{v/IL�3F}kP<�����p�a�b�<{W��Vy�n���I�T�p�a�b�<{��#���Q��g��Ik���u�6��>{�7��'/o�������n�K�J��*il�Vl����
i�����L��� ��No:��X�u�<�b�Z��'Q�������
�3[���#3�vQ���<��V���ksb��X�����&�E�5���<{��|PD��\L]s�����@�=��M�S�H?�A�k���m��?��AD�F <�$J����X�L�2�&s�	�������d��hxz�L��E>����Uh�
��o�����<����_��� ��kU����v�34 qrmxzc�k���bHm<� ��y�id4V�K������~0��No�l��l�c�����1���i�N?|!_<p}�)%Umq������%�]���N��:O��-��]y�
$N?�!_<���-��.� ����w��gG4:�y���J���/�"��HWw;1e�Y�����9���P�<���F�'o�>t|�4��}�B �1�
�b}N�"�E1����g�/���>�<��T���"���l��������2ip�fl�Vb��En�Vb��[�G}"\0�+z����O��2�������Ti<��z%s�r��g�v��Auq�rdAG�e�;&4�3Z���y�1I���i3�]��A��� �c�*����\>;��y�xJy\�hk������>R��Z�+�+v���|�x�y\I�B��.J[W��+V{Z-�����d���4��e�L��Ma�x^��tt�zc��rW�x>s���������A�)
�ApZ��cz2����g�����5��5��Q�h�jI)��^<�A���Ap
[jYvO�oQ�{��f�
����<r;-��&���M���c� ������*a��;��f��x��T� �����3���o6�:%	�6��f�����V�������wc�2���7%w�R�d�M2F�!��5�*���c� ��e��>5�����hB��XOvrF������z���c� ��e����Y����j�|��x�@�,�_�%�d��#� ��e��J������ "�G
�A`B�DQ�����zR��o�����j.�OD4_��@D �������.�]6�1�16�1V3��h1�+������V�)
�A�w[*mr[=���j�Q�Q�Y3�h1��&�� "�G�A�_�R�O��������AU)$7]L��E2X�N��`B����V�)
�A�_�w�n6��^�����\LY���:�k�@ ��@�}�&3x�ts��uQ�O��Y���j���F��N0���S��{�("\;�a	Wtn��\>�<��[���*��SV�)3[����sk8j�x$�kN*>NT�J
�R��	�E�����t �>�t�2&+�8���
$����Ft��g����$1|��
,���mFl��6�E����G0�����"�J���C�w�&����-������6�l �>�t��A�Oy� �P�_��-���uU�:�� ��T��:��mC�<�H(��g�<�'�=�ypnm��-\��p1�A3+6,K������:�c��a��t���K���zli;O��|�ey��m(����
���$��K��+vM/��������G�U��W'�%��������"GoW-�(x�r��ga�	�����"M|�V�w�	���b���!��Af0�\>�w�/UK�jI�H�����K����\Nq�����U?WH�_��%����,5�.Y��e�	�eN�<���<���&c�|��{�e#���(,c����s���L���=Jy6V�!w��
0-�L�?��>��q�@ta�����u�Ma�"�Q`�8������<^v ����A�ip����������\�b��_��ZR�)�$����y����WK��<���;}�vL� �4��q�@tI�)�b��f��vd�g��/;��sY����.��:��I�jpy�����k��K�^�y��p����x�'��BI��v��i���IM�O����(4F�����T�S�k�At�@tIsZ��r>�/I�+Kn�7����*F��M��Z��+z����f�c��~B�p�f��)t�CN�a����+�
>��V_"�����."��.i`N�Y����r���R���cBC���X�h�K��u���;'��k� N��u�fl���������.3��kO*�1[\����RL��K2��QUR��f��}h��w�`���1#6�.=�
,�6���E�E��t.U��0��*��0��.��]���������q���&�?��37��,��o���}B��R�O������)PDtIsZ��� �SJ�����#�sR��0��o*uz�NoZ�O:�h0��:���u���M�,R!�����J���8�����`4	W�k�.��t�7W�d�o
Iq`���bVO���!g]�Z�����wO�oQJn���";o���>.C�5��
s��u��XX7(5��,>>�~-�H����������������oK�R��[�ks$O�V^9������u+���b�p�E�I/�������yj�vk�����3�Zc0R��g
U=naW�����i���.�#��������_\����2]����2ZY{�`�����|��g�k�^���3g�\�|���#999E�x��~���/_���|��	�����Z�ks$O�A�>%�u����K�L*��)������\1���WZL� ��y�gZ:P��0�i���D��-����\��T��~��n���y��E){�a�F�g��w5r�I�6%6l��_�e�x�q����u�Zw�����w���`ss��S��������{�����O���<�7e:y���)���6G�$���In���o���~�Yp�m���.~�|>Yt.i-�I��<8�����/l���1���`2m�C�,�����\%i�`����J�;vd����^|x����p�8�x�.�'$$:r��.��_��W(�:r��D"!�������2���6G�t&�Jm�s��k����KiS����l�h�b�������Ik�)���U�[k`�����!�2:������A�<�f��k����/}���g�}�n������3����6GbaiiI�X5����\>;�*����<�$�S5��9��h�v�zxlKKKL�:{�<���I^���^x�����k4R���j�Y�wm�����j��{7o��yd���I��7�����������J7{���P&����xeh�����IG�A�w��Y�zrr����E?~\(|x#���k�g�Y�wm����'���Y��-���~J��������~a�9������q~��r��x��yP$���K���r����?���O(����O���������#G�,..���]�#q��	�o �"�'�)�n�)���1������U�]�y����7o=z���C�������v7���w�y����?v�Xll��Gl�k=�k�P��7n_��tJ�i���M������[�tgn0X<��m�����	5�O�)ypz��^���*��d4���,~���P�_���O��yp��s��J�R���`��e�E�~2�����i��&����8��L�`�9E�<���\����|v@y^t��<8�������-vL��;^Q�������>^_L�b�����E>W-eS�im�f��D��h%�E��h0Rx5��\>�r4{Q:�(�i�2]�Xd�>o
�-�v��$�!�4�2Zy1]�t�\�l���Q��_ �/�TU2]��m?�����)9���^p��NU'{�7?���N�ItN��
8S�����@�j�Do��:p��~��3��Ik���DV�d�lrYW*P6�{������a5���m��6�N��;{������~�	�u�C���)y�m\�vQx8%&5iN*���|3p����?$�R[4$CtjP����q>����;����ht&�����u@����� �2���$F��X^^~��w>��k�������[�n;v����?�������1����������_|Q,�wqq�������r���>�p������:~����$y��_�����7++�����fll�[����g`�������vww�~���wCBB<==I�Z�>t���g`�<���������	�����?��&5������g||������3����#G677�����GGG������%��#G>����?�auu5y���}���������o��6����:}���g`�ddd=z���+?��ON�8���)��p8;�k0�?~���+W�=z��_
�S���.000>>^"��:�|W,���Y�<�������7n,--��������g��t���/\�����>|����;r��������������[o�u������gR��Tr`z��y�L�`0|����������~��/���/~��H�������������J�����>|�������fz�999�����#G�}��^x�����|�������O��O_}������1[����W8d9����>�����{���+�N���~0;;�����o0}����?�����1�Zm0$IUU���G����S����e2YNN�/~��\������<�F����s��I�TZSS�������
�"88����ct�###�����K��������SQQQJ�266��N�mmm/��Rii�����`P������������tkjj�������������(..&�Lo���_����C��r��a�JEYC?�������6����Cd���r�JPP��}���
y=00�w�w������b���g�x����NR������]sww�(*00����Tj4�FD��R��y��d$�F��/���^�r��R.�>|���^�v����h�����Ul���_����C��������Y.���?��o�[Ry��Y����_���g�����at�������]�����w�s��E��T555�q���\���UEE����}�Z������������_�������9sF.�777?��sKKKE	��W^y�������9s����������N7..������F����EJJ�c�����p�r�!��hhh��7��'�'���?��O?���O?��/��/�;5s����g�zzz���.������_[[ct�������7kjj�����/����
���������1�+�������������o_����������7�����V���o~s���7�|377����rww��w���t766N�8q���.��lOO��g����K7o�dt����_�:�|���R�w���]< �Y�?d4r����[�����������.����J?O�Q���|>�����������	G��B����;�dff���;l���{��_�n�]f�^�onn�r����<��7H
�u��y�B����u��im��!����bxxxBB}U�����S=]��T\\|�����.����'M�������u��L&s�D�8]x��������g���������J���3�Z��:f�E]�z������K��|����|R��~#�����/��bFF������>��W���o�a�3�����+�������������;��A�ntt4�_�����?���������c�K��gxx�4��� ���<���o���������/^$�����������XYY��������2]��"""�y��\P8���~���}���'O�\[[c����:f�E}��_��_0�������<������j��BBBH���p�t�1���o�t�2]�������DL�9---_���&''���������g����:f�E�>}������JQ�N��������~��<��[o���������=X\f���
����k233���u�t���������kt�����3g���������o�i^s��1�oZt�t�X�4w��
endstream
endobj
28 0 obj
   16426
endobj
8 0 obj
<< /Type /ObjStm
   /Length 29 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
29 0 obj
   16
endobj
30 0 obj
<< /Type /ObjStm
   /Length 35 0 R
   /N 6
   /First 35
   /Filter /FlateDecode
>>
stream
x�U�Ok�@���)��4RH�_$�xP��b�&����
�����%c�q~��7o����R"8�#"���<#BB��L&Hv�Z#Y�B; y/�{pPlphn�����4(>��v�����v�A� ���5�&�Q�Jk�b��������$����T�.6�:7��7�������ki����I�X>t��Z(�-����fT2)$M_(�t����O`A�VV����X�c�f��=((2�����T^W�A����45&��h��#�m����\�z�����F��|������f9k�t�o�3�����{n�mr���@���+�����+�{��L����_�t��
endstream
endobj
35 0 obj
   329
endobj
36 0 obj
<< /Type /XRef
   /Length 144
   /Filter /FlateDecode
   /Size 37
   /W [1 3 2]
   /Root 34 0 R
   /Info 33 0 R
>>
stream
x�-�1A���?K"�F$N�U�KD�R��r	{��Abq
�hV����������]���'����
��@��];�l)3��Q[Y�!�����<M�{��g�{�3y��o���_�?���\�������f�)��G�wI�
endstream
endobj
startxref
158015
%%EOF

xeon.csv.gzapplication/gzip; name=xeon.csv.gzDownload

ryzen.csv.gzapplication/gzip; name=ryzen.csv.gzDownload

hb176.csv.gzapplication/gzip; name=hb176.csv.gzDownload

run.shapplication/x-shellscript; name=run.shDownload

generate.shapplication/x-shellscript; name=generate.shDownload

v20250804-0001-NUMA-interleaving-buffers.patch.gzapplication/gzip; name=v20250804-0001-NUMA-interleaving-buffers.patch.gzDownload

v20250804-0008-NUMA-pin-backends-to-NUMA-nodes.patch.gzapplication/gzip; name=v20250804-0008-NUMA-pin-backends-to-NUMA-nodes.patch.gzDownload

v20250804-0007-NUMA-interleave-PGPROC-entries.patch.gzapplication/gzip; name=v20250804-0007-NUMA-interleave-PGPROC-entries.patch.gzDownload

v20250804-0006-NUMA-clocksweep-allocation-balancing.patch.gzapplication/gzip; name=v20250804-0006-NUMA-clocksweep-allocation-balancing.patch.gzDownload

v20250804-0005-NUMA-clockweep-partitioning.patch.gzapplication/gzip; name=v20250804-0005-NUMA-clockweep-partitioning.patch.gzDownload

v20250804-0004-NUMA-partition-buffer-freelist.patch.gzapplication/gzip; name=v20250804-0004-NUMA-partition-buffer-freelist.patch.gzDownload

v20250804-0003-freelist-Don-t-track-tail-of-a-freelist.patch.gzapplication/gzip; name=v20250804-0003-freelist-Don-t-track-tail-of-a-freelist.patch.gzDownload

v20250804-0002-NUMA-localalloc.patch.gzapplication/gzip; name=v20250804-0002-NUMA-localalloc.patch.gzDownload

#52

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#51)

11 attachment(s)

Re: Adding basic NUMA awareness

Hi!

Here's a slightly improved version of the patch series.

The main improvement is related to rebalancing partitions of different
sizes (which can happen because the sizes have to be a multiple of some
minimal "chunk" determined by memory page size etc.). Part 0009 deals
with that by adjusting the allocations by partition size. It works OK,
but it's also true it matters less as the shared_buffers size increases
(as the relative difference between large/small partition gets smaller).

The other improvements are related to the pg_buffercache_partitions
view, showing the weights and (running) totals of allocations.

I plan to take a break from this patch series for a while, so this would
be a good time to take a look, do a review, run some tests etc. ;-)

One detail about the balancing I forgot to mention in my last message is
how the patch "distributes" allocations to match the balancing weights.
Consider for example the example weights from that message:

P1: [ 55, 45]
P2: [ 0, 100]

Imagine a backend located on P1 requests allocation of a buffer. The
weights say 55% buffers should be allocated from P1, and 45% should be
redirected to P2. One way to achieve that would be generating a random
number in [1, 100], and if it's [1,55] then P1, otherwise P2.

The patch does a much simpler thing - treat the weight as a "budget",
i.e. number of buffers to allocate before proceeding to the "next"
partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
and then goes back to P1 in a round-robin way. The advantage is it can
do away without a PRNG.

There's two things I'm not entirely sure about:

1) memory model - I'm not quite sure the current code ensures updates to
weights are properly "communicated" to the other processes. That is, if
the bgwriter recalculates the weights, will the other backends see the
new weights right away? Using a stale weights won't cause "failures",
the consequence is just a bit of imbalance. But it shouldn't stay like
that for too long, so maybe it'd be good to add some memory barriers or
something like that.

2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
assumes each core / piece of RAM is assigned to a particular NUMA node.
For the buffer partitioning the patch mostly cares about memory, as it
"locates" the buffers on different NUMA nodes. Which works mostly OK
(ignoring the issues with huge pages described in previous message).

But it also cares about the cores (and the node for each core), because
it uses that to pick the right partition for a backend. And here the
situation is less clear, because the CPUs don't need to be assigned to a
particular node, even on a NUMA system. Consider the rpi5 NUMA layout:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 992 MB
node 0 free: 274 MB
node 1 cpus: 0 1 2 3
node 1 size: 1019 MB
node 1 free: 327 MB
node 2 cpus: 0 1 2 3
node 2 size: 1019 MB
node 2 free: 321 MB
node 3 cpus: 0 1 2 3
node 3 size: 955 MB
node 3 free: 251 MB
node 4 cpus: 0 1 2 3
node 4 size: 1019 MB
node 4 free: 332 MB
node 5 cpus: 0 1 2 3
node 5 size: 1019 MB
node 5 free: 342 MB
node 6 cpus: 0 1 2 3
node 6 size: 1019 MB
node 6 free: 352 MB
node 7 cpus: 0 1 2 3
node 7 size: 1014 MB
node 7 free: 339 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 10 10 10 10 10 10 10
1: 10 10 10 10 10 10 10 10
2: 10 10 10 10 10 10 10 10
3: 10 10 10 10 10 10 10 10
4: 10 10 10 10 10 10 10 10
5: 10 10 10 10 10 10 10 10
6: 10 10 10 10 10 10 10 10
7: 10 10 10 10 10 10 10 10

This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
are not assigned to particular nodes - each core is mapped to all 8 NUMA
nodes. I'm not sure what to do about this (or how getcpu() or libnuma
handle this). And can the situation be even more complicated?

regards

--
Tomas Vondra

Attachments:

v20250807-0011-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v20250807-0011-NUMA-pin-backends-to-NUMA-nodes.patchDownload

From 630e4f0c9ce995ff5a9f1aaf68e8c64836cadffb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v20250807 11/11] NUMA: pin backends to NUMA nodes

When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
 src/backend/storage/lmgr/proc.c     | 21 +++++++++++++++++++++
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 33 insertions(+)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 11259151a7d..dbb4cbb1bfa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -766,6 +766,27 @@ InitProcess(void)
 	}
 	MyProcNumber = GetNumberFromPGProc(MyProc);
 
+	/*
+	 * Optionally, restrict the process to only run on CPUs from the same NUMA
+	 * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+	 * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+	 *
+	 * This also means we only do this with numa_procs_interleave, because
+	 * without that we'll have numa_node=-1 for all PGPROC entries.
+	 *
+	 * FIXME add proper error-checking for libnuma functions
+	 */
+	if (numa_procs_pin && MyProc->numa_node != -1)
+	{
+		struct bitmask *cpumask = numa_allocate_cpumask();
+
+		numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+		numa_sched_setaffinity(MyProcPid, cpumask);
+
+		numa_free_cpumask(cpumask);
+	}
+
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
 	 * the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ee4684d1b8..3f88659b49f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 bool		numa_partition_freelist = false;
 bool		numa_procs_interleave = false;
+bool		numa_procs_pin = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7b718760248..862341e137e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2156,6 +2156,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+			gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+		},
+		&numa_procs_pin,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cdeee8dccba..a97741c6707 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT bool numa_partition_freelist;
 extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.50.1

v20250807-0010-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v20250807-0010-NUMA-interleave-PGPROC-entries.patchDownload

From d2c7801011a394b5b9e00d60bf59e759b65e1bed Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v20250807 10/11] NUMA: interleave PGPROC entries

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 +++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 -
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  63 +-
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 565 +++++++++++++++++-
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/include/miscadmin.h                       |   1 +
 src/include/storage/proc.h                    |  14 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 722 insertions(+), 64 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 5acae31b836..ba54f69eeb4 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -38,3 +38,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 13014549d00..ee3aa8be2ce 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -30,6 +31,7 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	15
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -946,3 +949,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 777c9a8d555..087279a6a8e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 5b65a855b29..fb52039e1a6 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -500,8 +500,6 @@ buffer_partitions_prepare(void)
 	if (numa_nodes < 1)
 		numa_nodes = 1;
 
-	elog(WARNING, "IsUnderPostmaster %d", IsUnderPostmaster);
-
 	/*
 	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
 	 * run outside postmaster? I don't think so.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bbe29bc9729..878d1e33f61 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -510,7 +510,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf987aed8d3..3e86e4ca2ae 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1622,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1739,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2223,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2297,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2498,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2724,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2755,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2857,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3019,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3060,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3188,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3231,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3300,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3402,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3467,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3522,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3568,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3597,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3628,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3669,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3732,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3798,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..c84a2a5f1bc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..11259151a7d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -90,6 +98,31 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static Size get_memory_page_size(void); /* XXX duplicate with bufi_init.c */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -100,11 +133,63 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -129,6 +214,25 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
 	return size;
 }
 
@@ -140,6 +244,9 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
@@ -191,7 +298,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -205,6 +312,8 @@ InitProcGlobal(void)
 	Size		requestSize;
 	char	   *ptr;
 
+	Size		mem_page_size = get_memory_page_size();
+
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -241,19 +350,115 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partitoion
+		 * per NUMA node.
+		 */
+		partitions = (PGProcPartition *) ptr;
+		ptr += (numa_nodes * sizeof(PGProcPartition));
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/*
+		 * Now initialize the PGPROC partition registry with a single
+		 * partition for all the procs.
+		 */
+		partitions = (PGProcPartition *) ptr;
+		ptr += sizeof(PGProcPartition);
+
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +491,92 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (numa_procs_interleave && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
 
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
+
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
 
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
+
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
 		 * dummy PGPROCs don't need these though - they're never associated
@@ -366,15 +639,12 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
 	 */
-	AuxiliaryProcs = &procs[MaxBackends];
-	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+	AuxiliaryProcs = procs[MaxBackends];
+	PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
 	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +705,45 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if (numa_procs_interleave)
+		{
+			dlist_mutable_iter iter;
+			unsigned	cpu;
+			unsigned	node;
+			int			rc;
+
+			rc = getcpu(&cpu, &node);
+			if (rc != 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -1988,7 +2296,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2371,222 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/*
+	 * XXX This is a bit annoying/confusing, because we may get a different
+	 * result depending on when we call it. Before mmap() we don't know if the
+	 * huge pages get used, so we assume they will. And then if we don't get
+	 * huge pages, we'll waste memory etc.
+	 */
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status == HUGE_PAGES_OFF)
+		huge_page_size = 0;
+	else
+		GetHugePageSize(&huge_page_size, NULL);
+
+	return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+	numa_nodes = numa_num_configured_nodes();
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		numa_page_size = pg_get_shmem_pagesize();
+	else
+		numa_page_size = get_memory_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(LOG, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+static void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		mem_page_size;
+	Size		sz;
+
+	/*
+	 * Get the "actual" memory page size, not the one we used for sizing. We
+	 * might have used huge page for sizing, but only get regular pages when
+	 * allocating, so we must use the smaller pages here.
+	 *
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	Assert((int64) startptr % mem_page_size == 0);
+
+	sz = (endptr - startptr);
+	numa_tonode_memory(startptr, sz, node);
+}
+
+/*
+ * doesn't do alignment
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	elog(LOG, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (numa_procs_interleave && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a11bc71a386..6ee4684d1b8 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int			MaxBackends = 0;
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 bool		numa_partition_freelist = false;
+bool		numa_procs_interleave = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0552ed62cc7..7b718760248 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2146,6 +2146,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+			gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+		},
+		&numa_procs_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 66baf2bf33e..cdeee8dccba 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT bool numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..d2d269941fc 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -443,8 +448,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8540d537a3e..ded2db30422 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1876,6 +1876,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.50.1

v20250807-0009-NUMA-weighted-clocksweep-balancing.patchtext/x-patch; charset=UTF-8; name=v20250807-0009-NUMA-weighted-clocksweep-balancing.patchDownload

From 170673d4fe89bb9436c2af41216884fefa0a48ee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20250807 09/11] NUMA: weighted clocksweep balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index f2203cebcc8..bbe29bc9729 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -738,6 +738,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -764,16 +778,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -781,8 +806,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -845,6 +876,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -852,7 +887,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -866,22 +901,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.50.1

v20250807-0008-NUMA-clocksweep-allocation-balancing.patchtext/x-patch; charset=UTF-8; name=v20250807-0008-NUMA-clocksweep-allocation-balancing.patchDownload

From f832fedb847004990e7e806691ab4338752a5db5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 31 Jul 2025 19:50:05 +0200
Subject: [PATCH v20250807 08/11] NUMA: clocksweep allocation balancing

If backends only allocate buffers from the "local" partition, this could
cause significant misbalance - some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  41 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 396 ++++++++++++++++--
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |   5 +-
 6 files changed, 420 insertions(+), 31 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 999bb2128f0..5acae31b836 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -27,7 +27,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 7ca075e6164..13014549d00 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	15
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -789,6 +791,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -826,6 +830,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 13, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 14, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 15, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -851,11 +861,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					buffers_remain,
 					buffers_free;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -866,7 +882,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
 								 &buffers_free,
 								 &complete_passes, &next_victim_buffer,
-								 &buffer_total_allocs, &buffer_allocs);
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -904,6 +928,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[11] = Int64GetDatum(buffer_allocs);
 		nulls[11] = false;
 
+		values[12] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[12] = false;
+
+		values[13] = Int64GetDatum(buffer_req_allocs);
+		nulls[13] = false;
+
+		values[14] = PointerGetDatum(array);
+		nulls[14] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7a8c45ac59c..97b6f973c26 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3623,6 +3623,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 17988b4fd53..f2203cebcc8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -51,6 +51,22 @@ typedef struct BufferStrategyFreelist
 	uint64		consumed;
 }			BufferStrategyFreelist;
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions.
+ */
+#define MAX_BUFFER_PARTITIONS		16
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -83,9 +99,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -153,7 +188,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -165,7 +226,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -294,32 +355,72 @@ calculate_partition_index()
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
+
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
 
-	return &StrategyControl->sweeps[index];
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
  * ChooseFreeList
- *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
- *
- * Without partitioned freelists (numa_partition_freelist=false), there's only
- * a single freelist, so use that.
- *
- * With partitioned freelists, we have multiple ways how to pick the freelist
- * for the backend:
- *
- * - one freelist per CPU, use the freelist for CPU the task executes on
- *
- * - one freelist per NUMA node, use the freelist for node task executes on
+ *		pick a clocksweep partition based on NUMA node and CPU
  *
- * - use fixed number of freelists, map processes to lists based on PID
+ * The number of freelist partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
  *
- * There may be some other strategies, not sure. The important thing is this
- * needs to be refrecled during initialization, i.e. we need to create the
- * right number of lists.
+ * XXX Maybe this should use the same balancing strategy as clocksweep?
  */
 static BufferStrategyFreelist *
 ChooseFreeList(void)
@@ -417,7 +518,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
@@ -580,6 +681,224 @@ StrategyFreeBuffer(BufferDesc *buf)
 	SpinLockRelease(&freelist->freelist_lock);
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(LOG, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(LOG, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- prepare for sync of all partitions
  *
@@ -606,6 +925,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -852,7 +1172,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1242,7 +1576,9 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 void
 FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free,
 						 uint32 *complete_passes, uint32 *next_victim_buffer,
-						 uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	BufferStrategyFreelist *freelist;
 	ClockSweep *sweep;
@@ -1288,11 +1624,21 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
 
 	/* get the clocksweep stats too */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 907b160b4f7..38bd5511048 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,6 +449,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a6795c5fee9..e1729f0ee14 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -353,7 +353,10 @@ extern void FreelistPartitionGetInfo(int idx,
 									 uint32 *complete_passes,
 									 uint32 *next_victim_buffer,
 									 uint64 *buffer_total_allocs,
-									 uint32 *buffer_allocs);
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.50.1

v20250807-0007-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v20250807-0007-NUMA-clockweep-partitioning.patchDownload

From c2dbd991bf2720eb6d9295bffd61744f525a19c9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v20250807 07/11] NUMA: clockweep partitioning

Similar to the frelist patch - partition the "clocksweep" algorithm to
work on the sequence of smaller partitions, one by one.

It extends the "pg_buffercache_partitions" view to include information
about the clocksweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  32 +-
 src/backend/storage/buffer/bufmgr.c           | 476 ++++++++++--------
 src/backend/storage/buffer/freelist.c         | 238 +++++++--
 src/include/storage/buf_internals.h           |   4 +-
 src/include/storage/bufmgr.h                  |   6 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 504 insertions(+), 261 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 95fd2d2a226..999bb2128f0 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -21,7 +21,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	 -- freelists
 	 list_consumed bigint,		-- buffers consumed from a freelist
 	 list_remain bigint,		-- buffers left in a freelist
-	 list_free bigint);			-- number of free buffers
+	 list_free bigint,			-- number of free buffers
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 6d734464a22..7ca075e6164 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -818,6 +818,14 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "list_free",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -843,6 +851,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					buffers_remain,
 					buffers_free;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
@@ -850,7 +864,9 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
-								 &buffers_free);
+								 &buffers_free,
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -876,6 +892,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[7] = Int64GetDatum(buffers_free);
 		nulls[7] = false;
 
+		values[8] = Int64GetDatum(complete_passes);
+		nulls[8] = false;
+
+		values[9] = Int32GetDatum(next_victim_buffer);
+		nulls[9] = false;
+
+		values[10] = Int64GetDatum(buffer_total_allocs);
+		nulls[10] = false;
+
+		values[11] = Int64GetDatum(buffer_allocs);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bd50535385f..7a8c45ac59c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3584,6 +3584,23 @@ BufferSync(int flags)
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
+/*
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
+ *
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
+ *
+ * XXX might be better to have a per-partition struct with all the info
+ */
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
 /*
  * BgBufferSync -- Write out some dirty buffers in the pool.
  *
@@ -3599,55 +3616,24 @@ bool
 BgBufferSync(WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
-	int			strategy_buf_id;
-	uint32		strategy_passes;
 	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
 
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
-
-	/* Moving averages of allocation rate and clean-buffer density */
-	static float smoothed_alloc = 0;
-	static float smoothed_density = 10.0;
-
-	/* Potentially these could be tunables, but for now, not */
-	float		smoothing_samples = 16;
-	float		scan_whole_pool_milliseconds = 120000.0;
-
-	/* Used to compute how far we scan ahead */
-	long		strategy_delta;
-	int			bufs_to_lap;
-	int			bufs_ahead;
-	float		scans_per_alloc;
-	int			reusable_buffers_est;
-	int			upcoming_alloc_est;
-	int			min_scan_buffers;
-
-	/* Variables for the scanning loop proper */
-	int			num_to_scan;
-	int			num_written;
-	int			reusable_buffers;
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
 
-	/* Variables for final smoothed_density update */
-	long		new_strategy_delta;
-	uint32		new_recent_alloc;
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
-	/*
-	 * Find out where the freelist clock sweep currently is, and how many
-	 * buffer allocations have happened since our last call.
-	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
 
 	/* Report buffer alloc counts to pgstat */
 	PendingBgWriterStats.buf_alloc += recent_alloc;
 
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -3660,223 +3646,285 @@ BgBufferSync(WritebackContext *wb_context)
 	}
 
 	/*
-	 * Compute strategy_delta = how many buffers have been scanned by the
-	 * clock sweep since last time.  If first time through, assume none. Then
-	 * see if we are still ahead of the clock sweep, and if so, how many
-	 * buffers we could scan before we'd catch up with it and "lap" it. Note:
-	 * weird-looking coding of xxx_passes comparisons are to avoid bogus
-	 * behavior when the passes counts wrap around.
-	 */
-	if (saved_info_valid)
-	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
-
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* info obtained from freelist.c */
+		int			strategy_buf_id;
+		uint32		strategy_passes;
+
+		/* Moving averages of allocation rate and clean-buffer density */
+		static float smoothed_alloc = 0;
+		static float smoothed_density = 10.0;
+
+		/* Potentially these could be tunables, but for now, not */
+		float		smoothing_samples = 16;
+		float		scan_whole_pool_milliseconds = 120000.0;
+
+		/* Used to compute how far we scan ahead */
+		long		strategy_delta;
+		int			bufs_to_lap;
+		int			bufs_ahead;
+		float		scans_per_alloc;
+		int			reusable_buffers_est;
+		int			upcoming_alloc_est;
+		int			min_scan_buffers;
+
+		/* Variables for the scanning loop proper */
+		int			num_to_scan;
+		int			num_written;
+		int			reusable_buffers;
+
+		/* Variables for final smoothed_density update */
+		long		new_strategy_delta;
+		uint32		new_recent_alloc;
+
+		/* buffer range for the clocksweep partition */
+		int			first_buffer;
+		int			num_buffers;
 
-		Assert(strategy_delta >= 0);
+		/*
+		 * Find out where the freelist clock sweep currently is, and how many
+		 * buffer allocations have happened since our last call.
+		 */
+		strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+											&first_buffer, &num_buffers);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		/*
+		 * Compute strategy_delta = how many buffers have been scanned by the
+		 * clock sweep since last time.  If first time through, assume none.
+		 * Then see if we are still ahead of the clock sweep, and if so, how
+		 * many buffers we could scan before we'd catch up with it and "lap"
+		 * it. Note: weird-looking coding of xxx_passes comparisons are to
+		 * avoid bogus behavior when the passes counts wrap around.
+		 */
+		if (saved_info_valid)
 		{
-			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
+
+			strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+			strategy_delta += (long) passes_delta * num_buffers;
+
+			Assert(strategy_delta >= 0);
+
+			if ((int32) (next_passes[partition] - strategy_passes) > 0)
+			{
+				/* we're one pass ahead of the strategy point */
+				bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta, bufs_to_lap);
+				elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta, bufs_to_lap);
 #endif
-		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
-		{
-			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			}
+			else if (next_passes[partition] == strategy_passes &&
+					 next_to_clean[partition] >= strategy_buf_id)
+			{
+				/* on same pass, but ahead or at least not behind */
+				bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
+#ifdef BGW_DEBUG
+				elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta, bufs_to_lap);
+#endif
+			}
+			else
+			{
+				/*
+				 * We're behind, so skip forward to the strategy point and
+				 * start cleaning from there.
+				 */
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta, bufs_to_lap);
+				elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
+					 next_passes, next_to_clean,
+					 strategy_passes, strategy_buf_id,
+					 strategy_delta);
 #endif
+				next_to_clean[partition] = strategy_buf_id;
+				next_passes[partition] = strategy_passes;
+				bufs_to_lap = num_buffers;
+			}
 		}
 		else
 		{
 			/*
-			 * We're behind, so skip forward to the strategy point and start
-			 * cleaning from there.
+			 * Initializing at startup or after LRU scanning had been off.
+			 * Always start at the strategy point.
 			 */
 #ifdef BGW_DEBUG
-			elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
-				 next_passes, next_to_clean,
-				 strategy_passes, strategy_buf_id,
-				 strategy_delta);
+			elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
+				 strategy_passes, strategy_buf_id);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			strategy_delta = 0;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
-	}
-	else
-	{
-		/*
-		 * Initializing at startup or after LRU scanning had been off. Always
-		 * start at the strategy point.
-		 */
-#ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
-			 strategy_passes, strategy_buf_id);
-#endif
-		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
-	}
 
-	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+		/* Update saved info for next time */
+		prev_strategy_buf_id[partition] = strategy_buf_id;
+		prev_strategy_passes[partition] = strategy_passes;
+		/* FIXME has to happen after all partitions */
+		/* saved_info_valid = true; */
 
-	/*
-	 * Compute how many buffers had to be scanned for each new allocation, ie,
-	 * 1/density of reusable buffers, and track a moving average of that.
-	 *
-	 * If the strategy point didn't move, we don't update the density estimate
-	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
-	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
-		smoothed_density += (scans_per_alloc - smoothed_density) /
-			smoothing_samples;
-	}
+		/*
+		 * Compute how many buffers had to be scanned for each new allocation,
+		 * ie, 1/density of reusable buffers, and track a moving average of
+		 * that.
+		 *
+		 * If the strategy point didn't move, we don't update the density
+		 * estimate
+		 */
+		if (strategy_delta > 0 && recent_alloc_partition > 0)
+		{
+			scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
+			smoothed_density += (scans_per_alloc - smoothed_density) /
+				smoothing_samples;
+		}
 
-	/*
-	 * Estimate how many reusable buffers there are between the current
-	 * strategy point and where we've scanned ahead to, based on the smoothed
-	 * density estimate.
-	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
-	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
+		/*
+		 * Estimate how many reusable buffers there are between the current
+		 * strategy point and where we've scanned ahead to, based on the
+		 * smoothed density estimate.
+		 */
+		bufs_ahead = num_buffers - bufs_to_lap;
+		reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
-	/*
-	 * Track a moving average of recent buffer allocations.  Here, rather than
-	 * a true average we want a fast-attack, slow-decline behavior: we
-	 * immediately follow any increase.
-	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
-	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
-			smoothing_samples;
+		/*
+		 * Track a moving average of recent buffer allocations.  Here, rather
+		 * than a true average we want a fast-attack, slow-decline behavior:
+		 * we immediately follow any increase.
+		 */
+		if (smoothed_alloc <= (float) recent_alloc_partition)
+			smoothed_alloc = recent_alloc_partition;
+		else
+			smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
+				smoothing_samples;
 
-	/* Scale the estimate by a GUC to allow more aggressive tuning. */
-	upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
+		/* Scale the estimate by a GUC to allow more aggressive tuning. */
+		upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
 
-	/*
-	 * If recent_alloc remains at zero for many cycles, smoothed_alloc will
-	 * eventually underflow to zero, and the underflows produce annoying
-	 * kernel warnings on some platforms.  Once upcoming_alloc_est has gone to
-	 * zero, there's no point in tracking smaller and smaller values of
-	 * smoothed_alloc, so just reset it to exactly zero to avoid this
-	 * syndrome.  It will pop back up as soon as recent_alloc increases.
-	 */
-	if (upcoming_alloc_est == 0)
-		smoothed_alloc = 0;
+		/*
+		 * If recent_alloc remains at zero for many cycles, smoothed_alloc
+		 * will eventually underflow to zero, and the underflows produce
+		 * annoying kernel warnings on some platforms.  Once
+		 * upcoming_alloc_est has gone to zero, there's no point in tracking
+		 * smaller and smaller values of smoothed_alloc, so just reset it to
+		 * exactly zero to avoid this syndrome.  It will pop back up as soon
+		 * as recent_alloc increases.
+		 */
+		if (upcoming_alloc_est == 0)
+			smoothed_alloc = 0;
 
-	/*
-	 * Even in cases where there's been little or no buffer allocation
-	 * activity, we want to make a small amount of progress through the buffer
-	 * cache so that as many reusable buffers as possible are clean after an
-	 * idle period.
-	 *
-	 * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
-	 * the BGW will be called during the scan_whole_pool time; slice the
-	 * buffer pool into that many sections.
-	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+		/*
+		 * Even in cases where there's been little or no buffer allocation
+		 * activity, we want to make a small amount of progress through the
+		 * buffer cache so that as many reusable buffers as possible are clean
+		 * after an idle period.
+		 *
+		 * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many
+		 * times the BGW will be called during the scan_whole_pool time; slice
+		 * the buffer pool into that many sections.
+		 */
+		min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
-	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
-	{
+		if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
+		{
 #ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
-			 upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
+			elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
+				 upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
 #endif
-		upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
-	}
-
-	/*
-	 * Now write out dirty reusable buffers, working forward from the
-	 * next_to_clean point, until we have lapped the strategy scan, or cleaned
-	 * enough buffers to match our estimate of the next cycle's allocation
-	 * requirements, or hit the bgwriter_lru_maxpages limit.
-	 */
+			upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
+		}
 
-	num_to_scan = bufs_to_lap;
-	num_written = 0;
-	reusable_buffers = reusable_buffers_est;
+		/*
+		 * Now write out dirty reusable buffers, working forward from the
+		 * next_to_clean point, until we have lapped the strategy scan, or
+		 * cleaned enough buffers to match our estimate of the next cycle's
+		 * allocation requirements, or hit the bgwriter_lru_maxpages limit.
+		 */
 
-	/* Execute the LRU scan */
-	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
-	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		num_to_scan = bufs_to_lap;
+		num_written = 0;
+		reusable_buffers = reusable_buffers_est;
 
-		if (++next_to_clean >= NBuffers)
+		/* Execute the LRU scan */
+		while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 		{
-			next_to_clean = 0;
-			next_passes++;
-		}
-		num_to_scan--;
+			int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
+												   wb_context);
 
-		if (sync_state & BUF_WRITTEN)
-		{
-			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 			{
-				PendingBgWriterStats.maxwritten_clean++;
-				break;
+				next_to_clean[partition] = first_buffer;
+				next_passes[partition]++;
+			}
+			num_to_scan--;
+
+			if (sync_state & BUF_WRITTEN)
+			{
+				reusable_buffers++;
+				if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
+				{
+					PendingBgWriterStats.maxwritten_clean++;
+					break;
+				}
 			}
+			else if (sync_state & BUF_REUSABLE)
+				reusable_buffers++;
 		}
-		else if (sync_state & BUF_REUSABLE)
-			reusable_buffers++;
-	}
 
-	PendingBgWriterStats.buf_written_clean += num_written;
+		PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
-	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
-		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
-		 bufs_to_lap - num_to_scan,
-		 num_written,
-		 reusable_buffers - reusable_buffers_est);
+		elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
+			 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
+			 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
+			 bufs_to_lap - num_to_scan,
+			 num_written,
+			 reusable_buffers - reusable_buffers_est);
 #endif
 
-	/*
-	 * Consider the above scan as being like a new allocation scan.
-	 * Characterize its density and update the smoothed one based on it. This
-	 * effectively halves the moving average period in cases where both the
-	 * strategy and the background writer are doing some useful scanning,
-	 * which is helpful because a long memory isn't as desirable on the
-	 * density estimates.
-	 */
-	new_strategy_delta = bufs_to_lap - num_to_scan;
-	new_recent_alloc = reusable_buffers - reusable_buffers_est;
-	if (new_strategy_delta > 0 && new_recent_alloc > 0)
-	{
-		scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
-		smoothed_density += (scans_per_alloc - smoothed_density) /
-			smoothing_samples;
+		/*
+		 * Consider the above scan as being like a new allocation scan.
+		 * Characterize its density and update the smoothed one based on it.
+		 * This effectively halves the moving average period in cases where
+		 * both the strategy and the background writer are doing some useful
+		 * scanning, which is helpful because a long memory isn't as desirable
+		 * on the density estimates.
+		 */
+		new_strategy_delta = bufs_to_lap - num_to_scan;
+		new_recent_alloc = reusable_buffers - reusable_buffers_est;
+		if (new_strategy_delta > 0 && new_recent_alloc > 0)
+		{
+			scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
+			smoothed_density += (scans_per_alloc - smoothed_density) /
+				smoothing_samples;
 
 #ifdef BGW_DEBUG
-		elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
-			 new_recent_alloc, new_strategy_delta,
-			 scans_per_alloc, smoothed_density);
+			elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
+				 new_recent_alloc, new_strategy_delta,
+				 scans_per_alloc, smoothed_density);
 #endif
+		}
+
+		/* hibernate if all partitions can hibernate */
+		hibernate &= (bufs_to_lap == 0 && recent_alloc_partition == 0);
 	}
 
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5a63dad7f2c..17988b4fd53 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -52,17 +52,27 @@ typedef struct BufferStrategyFreelist
 }			BufferStrategyFreelist;
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -73,6 +83,19 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
@@ -88,6 +111,9 @@ typedef struct
 	int			num_partitions;
 	int			num_partitions_per_node;
 
+	/* clocksweep partitions */
+	ClockSweep *sweeps;
+
 	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
@@ -127,6 +153,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -138,6 +165,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -145,14 +173,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -178,19 +206,23 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
 }
 
 static int
@@ -247,6 +279,28 @@ calculate_partition_index()
 	return index;
 }
 
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
+}
+
 /*
  * ChooseFreeList
  *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
@@ -363,7 +417,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
@@ -434,13 +488,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Nothing on the freelist, so run the "clock sweep" algorithm
 	 *
-	 * XXX Should we also make this NUMA-aware, to only access buffers from
-	 * the same NUMA node? That'd probably mean we need to make the clock
-	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
-	 * subset of buffers. But that also means each process could "sweep" only
-	 * a fraction of buffers, even if the other buffers are better candidates
-	 * for eviction. Would that also mean we'd have multiple bgwriters, one
-	 * for each node, or would one bgwriter handle all of that?
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
 	 */
 	trycounter = NBuffers;
 	for (;;)
@@ -522,6 +580,46 @@ StrategyFreeBuffer(BufferDesc *buf)
 	SpinLockRelease(&freelist->freelist_lock);
 }
 
+/*
+ * StrategySyncStart -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -529,37 +627,44 @@ StrategyFreeBuffer(BufferDesc *buf)
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -647,6 +752,10 @@ StrategyShmemSize(void)
 	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
 											num_partitions)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -665,6 +774,7 @@ StrategyInitialize(bool init)
 	int			num_nodes;
 	int			num_partitions;
 	int			num_partitions_per_node;
+	char	   *ptr;
 
 	/* */
 	BufferPartitionParams(&num_partitions, &num_nodes);
@@ -692,7 +802,8 @@ StrategyInitialize(bool init)
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
 						MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
-						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
+						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -707,12 +818,42 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* have to point the sweeps array to right after the freelists */
+		ptr = (char *) StrategyControl +
+			MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+			MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions);
+		StrategyControl->sweeps = (ClockSweep *) ptr;
+
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
+
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
@@ -760,7 +901,6 @@ StrategyInitialize(bool init)
 				buf->freeNext = freelist->firstFreeBuffer;
 				freelist->firstFreeBuffer = i;
 			}
-
 		}
 	}
 	else
@@ -1100,9 +1240,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 }
 
 void
-FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free)
+FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free,
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs)
 {
 	BufferStrategyFreelist *freelist;
+	ClockSweep *sweep;
 	int			cur;
 
 	/* stats */
@@ -1112,6 +1255,7 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
 	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
 
 	freelist = &StrategyControl->freelists[idx];
+	sweep = &StrategyControl->sweeps[idx];
 
 	/* stat */
 	SpinLockAcquire(&freelist->freelist_lock);
@@ -1141,4 +1285,14 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
 
 	*remain = cnt_remain;
 	*actually_free = cnt_free;
+
+	/* get the clocksweep stats too */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9dfbecb9fe4..907b160b4f7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,7 +449,9 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index df127274190..a6795c5fee9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -349,7 +349,11 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void FreelistPartitionGetInfo(int idx,
 									 uint64 *consumed, uint64 *remain,
-									 uint64 *actually_free);
+									 uint64 *actually_free,
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs);
 
 /* inline functions */
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 03ca3b7c8bc..8540d537a3e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -427,6 +427,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.50.1

v20250807-0006-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v20250807-0006-NUMA-partition-buffer-freelist.patchDownload

From a505c4e23d60d5aac911c24391444ba1b8320bf0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v20250807 06/11] NUMA: partition buffer freelist

Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.

This uses the buffer partitioning scheme introduced by the earlier
patch, i.e. the partitions will "align" with NUMA nodes, etc.

It also extends the "pg_buffercache_partitions" view, to include
information about each freelist (number of consumedd buffers, ...).

When allocating a buffer, it's taken from the correct freelist (same
NUMA node).

Note: This is (probably) more important than partitioning ProcArray.
---
 .../pg_buffercache--1.6--1.7.sql              |   7 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  24 +-
 src/backend/storage/buffer/freelist.c         | 349 ++++++++++++++++--
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/include/miscadmin.h                       |   1 +
 src/include/storage/bufmgr.h                  |   4 +-
 7 files changed, 366 insertions(+), 30 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index fb9003c011e..95fd2d2a226 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -16,7 +16,12 @@ CREATE VIEW pg_buffercache_partitions AS
 	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- freelists
+	 list_consumed bigint,		-- buffers consumed from a freelist
+	 list_remain bigint,		-- buffers left in a freelist
+	 list_free bigint);			-- number of free buffers
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8baa7c7b543..6d734464a22 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -812,6 +812,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "list_consumed",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "list_remain",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "list_free",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -833,12 +839,19 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffers_consumed,
+					buffers_remain,
+					buffers_free;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
+								 &buffers_free);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -854,6 +867,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
+		values[5] = Int64GetDatum(buffers_consumed);
+		nulls[5] = false;
+
+		values[6] = Int64GetDatum(buffers_remain);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffers_free);
+		nulls[7] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..5a63dad7f2c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,41 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+	/* Spinlock: protects the values below */
+	slock_t		freelist_lock;
+
+	/*
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres.
+	 */
+	int			firstFreeBuffer __attribute__((aligned(64)));	/* Head of list of
+																 * unused buffers */
+
+	/* Number of buffers consumed from this list. */
+	uint64		consumed;
+}			BufferStrategyFreelist;
 
 /*
  * The shared freelist control information.
@@ -39,8 +66,6 @@ typedef struct
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
-	int			firstFreeBuffer;	/* Head of list of unused buffers */
-
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
@@ -51,8 +76,19 @@ typedef struct
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
+	 *
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres. Also, shouldn't the alignment be specified after, like for
+	 * "consumed"?
 	 */
-	int			bgwprocno;
+	int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -157,6 +193,88 @@ ClockSweepTick(void)
 	return victim;
 }
 
+static int
+calculate_partition_index()
+{
+	int			rc;
+	unsigned	cpu;
+	unsigned	node;
+	int			index;
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+	 * list based on that.
+	 */
+	rc = getcpu(&cpu, &node);
+	if (rc != 0)
+		elog(ERROR, "getcpu failed: %m");
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
+}
+
+/*
+ * ChooseFreeList
+ *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->freelists[index];
+}
+
 /*
  * have_free_buffer -- a lockless check to see if there is a free buffer in
  *					   buffer pool.
@@ -168,10 +286,13 @@ ClockSweepTick(void)
 bool
 have_free_buffer(void)
 {
-	if (StrategyControl->firstFreeBuffer >= 0)
-		return true;
-	else
-		return false;
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+			return true;
+	}
+
+	return false;
 }
 
 /*
@@ -193,6 +314,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	BufferStrategyFreelist *freelist;
 
 	*from_ring = false;
 
@@ -259,31 +381,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
 	 * manipulate them without holding the spinlock.
 	 */
-	if (StrategyControl->firstFreeBuffer >= 0)
+	freelist = ChooseFreeList();
+	if (freelist->firstFreeBuffer >= 0)
 	{
 		while (true)
 		{
 			/* Acquire the spinlock to remove element from the freelist */
-			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			SpinLockAcquire(&freelist->freelist_lock);
 
-			if (StrategyControl->firstFreeBuffer < 0)
+			if (freelist->firstFreeBuffer < 0)
 			{
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				SpinLockRelease(&freelist->freelist_lock);
 				break;
 			}
 
-			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+			buf = GetBufferDescriptor(freelist->firstFreeBuffer);
 			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
 			/* Unconditionally remove buffer from freelist */
-			StrategyControl->firstFreeBuffer = buf->freeNext;
+			freelist->firstFreeBuffer = buf->freeNext;
 			buf->freeNext = FREENEXT_NOT_IN_LIST;
 
+			/* increment number of buffers we consumed from this list */
+			freelist->consumed++;
+
 			/*
 			 * Release the lock so someone else can access the freelist while
 			 * we check out this buffer.
 			 */
-			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			SpinLockRelease(&freelist->freelist_lock);
 
 			/*
 			 * If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +431,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		}
 	}
 
-	/* Nothing on the freelist, so run the "clock sweep" algorithm */
+	/*
+	 * Nothing on the freelist, so run the "clock sweep" algorithm
+	 *
+	 * XXX Should we also make this NUMA-aware, to only access buffers from
+	 * the same NUMA node? That'd probably mean we need to make the clock
+	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+	 * subset of buffers. But that also means each process could "sweep" only
+	 * a fraction of buffers, even if the other buffers are better candidates
+	 * for eviction. Would that also mean we'd have multiple bgwriters, one
+	 * for each node, or would one bgwriter handle all of that?
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -356,7 +492,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 void
 StrategyFreeBuffer(BufferDesc *buf)
 {
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	BufferStrategyFreelist *freelist;
+
+	/*
+	 * We don't want to call ChooseFreeList() again, because we might get a
+	 * completely different freelist - either a different partition in the
+	 * same group, or even a different group if the NUMA node changed. But we
+	 * can calculate the proper freelist from the buffer id.
+	 */
+	int			index = (BufferGetNode(buf->buf_id) * StrategyControl->num_partitions_per_node)
+		+ (buf->buf_id % StrategyControl->num_partitions_per_node);
+
+	Assert((index >= 0) && (index < StrategyControl->num_partitions));
+
+	freelist = &StrategyControl->freelists[index];
+
+	SpinLockAcquire(&freelist->freelist_lock);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -364,11 +515,11 @@ StrategyFreeBuffer(BufferDesc *buf)
 	 */
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
-		buf->freeNext = StrategyControl->firstFreeBuffer;
-		StrategyControl->firstFreeBuffer = buf->buf_id;
+		buf->freeNext = freelist->firstFreeBuffer;
+		freelist->firstFreeBuffer = buf->buf_id;
 	}
 
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+	SpinLockRelease(&freelist->freelist_lock);
 }
 
 /*
@@ -432,6 +583,42 @@ StrategyNotifyBgWriter(int bgwprocno)
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+	for (int p = 0; p < StrategyControl->num_partitions; p++)
+	{
+		BufferStrategyFreelist *freelist = &StrategyControl->freelists[p];
+		uint64		remain = 0;
+		uint64		actually_free = 0;
+		int			cur = freelist->firstFreeBuffer;
+
+		while (cur >= 0)
+		{
+			uint32		local_buf_state;
+			BufferDesc *buf;
+
+			buf = GetBufferDescriptor(cur);
+
+			remain++;
+
+			local_buf_state = LockBufHdr(buf);
+
+			if (!(local_buf_state & BM_TAG_VALID))
+				actually_free++;
+
+			UnlockBufHdr(buf, local_buf_state);
+
+			cur = buf->freeNext;
+		}
+		elog(LOG, "NUMA: freelist partition %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+			 p,
+			 freelist->firstFreeBuffer,
+			 freelist->consumed,
+			 remain, actually_free);
+	}
+}
 
 /*
  * StrategyShmemSize
@@ -445,12 +632,20 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+	int			num_nodes;
+
+	BufferPartitionParams(&num_partitions, &num_nodes);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
-	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+	size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+	/* size of freelist partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+											num_partitions)));
 
 	return size;
 }
@@ -467,6 +662,18 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	/* */
+	BufferPartitionParams(&num_partitions, &num_nodes);
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -484,7 +691,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+						MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
 						&found);
 
 	if (!found)
@@ -494,13 +702,10 @@ StrategyInitialize(bool init)
 		 */
 		Assert(init);
 
-		SpinLockInit(&StrategyControl->buffer_strategy_lock);
+		/* register callback to dump some stats on exit */
+		before_shmem_exit(freelist_before_shmem_exit, 0);
 
-		/*
-		 * Grab the whole linked list of free buffers for our strategy. We
-		 * assume it was previously set up by BufferManagerShmemInit().
-		 */
-		StrategyControl->firstFreeBuffer = 0;
+		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +716,52 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
+
+		/*
+		 * Rebuild the freelist - right now all buffers are in one huge list,
+		 * we want to rework that into multiple lists. Start by initializing
+		 * the strategy to have empty lists.
+		 */
+		for (int nfreelist = 0; nfreelist < num_partitions; nfreelist++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			BufferStrategyFreelist *freelist;
+
+			freelist = &StrategyControl->freelists[nfreelist];
+
+			freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+			SpinLockInit(&freelist->freelist_lock);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(nfreelist, &node,
+							   &num_buffers, &first_buffer, &last_buffer);
+
+			/*
+			 * Walk through buffers for each partition, add them to the list.
+			 * Walk from the end, because we're adding the buffers to the
+			 * beginning.
+			 */
+
+			for (int i = last_buffer; i >= first_buffer; i--)
+			{
+				BufferDesc *buf = GetBufferDescriptor(i);
+
+				/* add to the freelist */
+				buf->freeNext = freelist->firstFreeBuffer;
+				freelist->firstFreeBuffer = i;
+			}
+
+		}
 	}
 	else
 		Assert(!init);
@@ -847,3 +1098,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free)
+{
+	BufferStrategyFreelist *freelist;
+	int			cur;
+
+	/* stats */
+	uint64		cnt_remain = 0;
+	uint64		cnt_free = 0;
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	freelist = &StrategyControl->freelists[idx];
+
+	/* stat */
+	SpinLockAcquire(&freelist->freelist_lock);
+
+	*consumed = freelist->consumed;
+
+	cur = freelist->firstFreeBuffer;
+	while (cur >= 0)
+	{
+		uint32		local_buf_state;
+		BufferDesc *buf;
+
+		buf = GetBufferDescriptor(cur);
+
+		cnt_remain++;
+
+		local_buf_state = LockBufHdr(buf);
+
+		if (!(local_buf_state & BM_TAG_VALID))
+			cnt_free++;
+
+		UnlockBufHdr(buf, local_buf_state);
+
+		cur = buf->freeNext;
+	}
+	SpinLockRelease(&freelist->freelist_lock);
+
+	*remain = cnt_remain;
+	*actually_free = cnt_free;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..a11bc71a386 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int			MaxBackends = 0;
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
+bool		numa_partition_freelist = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a21f20800fb..0552ed62cc7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2136,6 +2136,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_partition_freelist", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+			gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+		},
+		&numa_partition_freelist,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..66baf2bf33e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT bool numa_partition_freelist;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index deaf4f19fa4..df127274190 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -347,7 +347,9 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
-
+extern void FreelistPartitionGetInfo(int idx,
+									 uint64 *consumed, uint64 *remain,
+									 uint64 *actually_free);
 
 /* inline functions */
 
-- 
2.50.1

v20250807-0005-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v20250807-0005-freelist-Don-t-track-tail-of-a-freelist.patchDownload

From e0fa771531b92dc096fd5a1580ded3224524fa65 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v20250807 05/11] freelist: Don't track tail of a freelist

The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
 src/backend/storage/buffer/freelist.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
 	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
-	int			lastFreeBuffer; /* Tail of list of unused buffers */
-
-	/*
-	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
-	 * when the list is empty)
-	 */
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
 		buf->freeNext = StrategyControl->firstFreeBuffer;
-		if (buf->freeNext < 0)
-			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}
 
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
 		 * assume it was previously set up by BufferManagerShmemInit().
 		 */
 		StrategyControl->firstFreeBuffer = 0;
-		StrategyControl->lastFreeBuffer = NBuffers - 1;
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
-- 
2.50.1

v20250807-0004-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v20250807-0004-NUMA-localalloc.patchDownload

From 7fec3b33652e92903a20276536477df33be722c8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v20250807 04/11] NUMA: localalloc

Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).

This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.

XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/init/miscinit.c   | 17 +++++++++++++++++
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 29 insertions(+)

diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int			MaxBackends = 0;
 
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
+bool		numa_localalloc = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 65d8cbfaed5..d986a1d18cf 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
 #include <arpa/inet.h>
 #include <utime.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
@@ -164,6 +168,19 @@ InitPostmasterChild(void)
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
 #endif
+
+#ifdef USE_LIBNUMA
+
+	/*
+	 * Set the default allocation policy to local node, where the task is
+	 * executing at the time of a page fault.
+	 *
+	 * XXX I believe this is not necessary, now that we don't use automatic
+	 * interleaving (numa_set_interleave_mask).
+	 */
+	if (numa_localalloc)
+		numa_set_localalloc();
+#endif
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9570087aa60..a21f20800fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables setting the default allocation policy to local node."),
+			gettext_noop("When enabled, allocate from the node where the task is executing."),
+		},
+		&numa_localalloc,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.50.1

v20250807-0003-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v20250807-0003-NUMA-interleaving-buffers.patchDownload

From bbe6427b4e0e871bcb7b2cc4ce11ad8aba62799c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 28 Jul 2025 14:01:37 +0200
Subject: [PATCH v20250807 03/11] NUMA: interleaving buffers

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, NUMA node, etc. This serves as a source
of truth, both for this patch and for later patches building on this
same buffer partition structure.

With the feature disabled (GUC set to 'off'), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* The feature is enabled by numa_buffers_interleave GUC (default: false)

* It's not clear we want to enable interleaving for all shared memory.
  We probably want that for shared buffers, but maybe not for ProcArray
  or freelists.

* Similar questions are about huge pages - in general it's a good idea,
  but maybe it's not quite good for ProcArray. It's somewhate separate
  from NUMA, but not entirely because NUMA works on page granularity.
  PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
  as we don't want to split the entry to multiple nodes. But could be
  done explicitly, by specifying which node to use for the pages.

* We could partition ProcArray, with one partition per NUMA node, and
  then at connection time pick a node from the same node. The process
  could migrate to some other node later, especially for long-lived
  connections, but there's no perfect solution, Maybe we could set
  affinity to cores from the same node, or something like that?
---
 contrib/pg_buffercache/Makefile               |   2 +-
 .../pg_buffercache--1.6--1.7.sql              |  26 +
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  92 +++
 src/backend/storage/buffer/buf_init.c         | 626 +++++++++++++++++-
 src/backend/utils/init/globals.c              |   3 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/include/miscadmin.h                       |   2 +
 src/include/storage/buf_internals.h           |   6 +
 src/include/storage/bufmgr.h                  |  15 +
 src/tools/pgindent/typedefs.list              |   2 +
 11 files changed, 775 insertions(+), 11 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..fb9003c011e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,26 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ae0291e6e96..8baa7c7b543 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -771,3 +773,93 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &numa_node, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_buffers);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(first_buffer);
+		nulls[3] = false;
+
+		values[4] = Int32GetDatum(last_buffer);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..5b65a855b29 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +32,19 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+BufferPartitions *BufferPartitionsArray;
+
+static Size get_memory_page_size(void);
+static void buffer_partitions_prepare(void);
+static void buffer_partitions_init(void);
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
+
 
 /*
  * Data Structures:
@@ -70,19 +91,89 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+	Size		mem_page_size;
+	Size		buffer_align;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+	 * on page granularity, and we don't want a buffer to get split to
+	 * multiple nodes (when using multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
+
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), numa_partitions),
+						&foundParts);
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +203,12 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Initialize the registry of buffer partitions, and also move the
+		 * memory to different NUMA nodes (if enabled by GUC)
+		 */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -144,6 +241,11 @@ BufferManagerShmemInit(void)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 	}
 
+	/*
+	 * As this point we have all the buffers in a single long freelist. With
+	 * freelist partitioning we rebuild them in StrategyInitialize.
+	 */
+
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
@@ -152,24 +254,68 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
+
 /*
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -184,5 +330,467 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
+
 	return size;
 }
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA interleaving */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * pg_numa_interleave_memory
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ *              of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		mem_page_size;
+	Size		sz;
+
+	/*
+	 * Get the "actual" memory page size, not the one we used for sizing. We
+	 * might have used huge page for sizing, but only get regular pages when
+	 * allocating, so we must use the smaller pages here.
+	 *
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	Assert((int64) startptr % mem_page_size == 0);
+
+	sz = (endptr - startptr);
+	numa_tonode_memory(startptr, sz, node);
+}
+
+
+#define MIN_BUFFER_PARTITIONS	4
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+	numa_nodes = numa_num_configured_nodes();
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	elog(WARNING, "IsUnderPostmaster %d", IsUnderPostmaster);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		numa_page_size = pg_get_shmem_pagesize();
+	else
+		numa_page_size = get_memory_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or can't map to that many nodes, just
+	 * take a "fair share" of buffers.
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(LOG, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (numa_partitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
+
+	BufferPartitionsArray->npartitions = numa_partitions;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
+
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
+
+		remaining_buffers -= node_buffers;
+
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
+
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+
+			/* XXX we should get the actual node ID from the mask */
+			part->numa_node = n;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(LOG, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
+	}
+
+	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (!numa_buffers_interleave || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not have mapping
+	 * for each page.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+		char	   *startptr,
+				   *endptr;
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(LOG, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %ld)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(LOG, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %ld)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *node, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*node = part->numa_node;
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
+
+void
+BufferPartitionParams(int *num_partitions, int *num_nodes)
+{
+	*num_partitions = numa_partitions;
+	*num_nodes = numa_nodes;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+/* NUMA stuff */
+bool		numa_buffers_interleave = false;
+
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..9570087aa60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of shared buffers."),
+			gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+		},
+		&numa_buffers_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
 extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..9dfbecb9fe4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -323,6 +323,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -491,4 +492,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..deaf4f19fa4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -143,6 +143,20 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+typedef struct BufferPartition
+{
+	int			numa_node;
+	int			num_buffers;
+	int			first_buffer;
+	int			last_buffer;
+} BufferPartition;
+
+typedef struct BufferPartitions
+{
+	int			npartitions;
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -319,6 +333,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..03ca3b7c8bc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -346,6 +346,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.50.1

v20250807-0002-nbtree-Use-ReadRecentBuffer-in_bt_getroot.patchtext/x-patch; charset=UTF-8; name=v20250807-0002-nbtree-Use-ReadRecentBuffer-in_bt_getroot.patchDownload

From 730467cfcf1d1b7f0a22f61bd48a37371f80cce9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 27 Jun 2023 14:07:34 -0700
Subject: [PATCH v20250807 02/11] nbtree: Use ReadRecentBuffer()
 in_bt_getroot()

We tend to access the btree root page over and over, when descending the
index. It's quite expensive. Not really specific to NUMA, but it hurts
there much more.

Thomas Munro worked on this.

Discussion: https://www.postgresql.org/message-id/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ8N_DRSB0YioinWjS2ycMpmOLy32mbBqVVztwBvXgyJA%40mail.gmail.com
---
 src/backend/access/nbtree/nbtpage.c   | 21 ++++++++++++++--
 src/backend/access/nbtree/nbtsearch.c | 17 +++++++++----
 src/backend/storage/buffer/bufmgr.c   | 35 +++++----------------------
 src/include/utils/backend_status.h    |  4 +++
 src/include/utils/rel.h               |  4 +++
 5 files changed, 45 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c79dd38ee18..3cc26490f06 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -35,6 +35,7 @@
 #include "storage/procarray.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
@@ -344,7 +345,7 @@ Buffer
 _bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
-	Buffer		rootbuf;
+	Buffer		rootbuf = InvalidBuffer;
 	Page		rootpage;
 	BTPageOpaque rootopaque;
 	BlockNumber rootblkno;
@@ -373,7 +374,20 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+
+		if (BufferIsValid(rel->rd_recent_root))
+		{
+			if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, rootblkno,
+								 rel->rd_recent_root))
+			{
+				rootbuf = rel->rd_recent_root;
+				_bt_lockbuf(rel, rootbuf, BT_READ);
+				_bt_checkpage(rel, rootbuf);
+			}
+		}
+
+		if (rootbuf == InvalidBuffer)
+			rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -390,6 +404,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 			P_RIGHTMOST(rootopaque))
 		{
 			/* OK, accept cached page as the root */
+			rel->rd_recent_root = rootbuf;
 			return rootbuf;
 		}
 		_bt_relbuf(rel, rootbuf);
@@ -555,6 +570,8 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 				 rootopaque->btpo_level, rootlevel);
 	}
 
+	rel->rd_recent_root = rootbuf;
+
 	/*
 	 * By here, we have a pin and read lock on the root page, and no lock set
 	 * on the metadata page.  Return the root page's buffer.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795b4..e1e73b4aaf3 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -168,11 +168,17 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * stack entry for this page/level.  If caller ends up splitting a
 		 * page one level down, it usually ends up inserting a new pivot
 		 * tuple/downlink immediately after the location recorded here.
+		 *
+		 * FIXME: Unfortunately this isn't a usable gating condition, as
+		 * vacuum uses BT_READ and needs the stack.
 		 */
-		new_stack = (BTStack) palloc(sizeof(BTStackData));
-		new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
-		new_stack->bts_offset = offnum;
-		new_stack->bts_parent = stack_in;
+		if (false && access == BT_WRITE)
+		{
+			new_stack = (BTStack) palloc(sizeof(BTStackData));
+			new_stack->bts_blkno = BufferGetBlockNumber(*bufP);
+			new_stack->bts_offset = offnum;
+			new_stack->bts_parent = stack_in;
+		}
 
 		/*
 		 * Page level 1 is lowest non-leaf page level prior to leaves.  So, if
@@ -186,7 +192,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		*bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
 
 		/* okay, all set to move down a level */
-		stack_in = new_stack;
+		if (false && access == BT_WRITE)
+			stack_in = new_stack;
 	}
 
 	/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 67431208e7f..bd50535385f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -685,7 +685,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	BufferDesc *bufHdr;
 	BufferTag	tag;
 	uint32		buf_state;
-	bool		have_private_ref;
 
 	Assert(BufferIsValid(recent_buffer));
 
@@ -713,38 +712,16 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	else
 	{
 		bufHdr = GetBufferDescriptor(recent_buffer - 1);
-		have_private_ref = GetPrivateRefCount(recent_buffer) > 0;
 
-		/*
-		 * Do we already have this buffer pinned with a private reference?  If
-		 * so, it must be valid and it is safe to check the tag without
-		 * locking.  If not, we have to lock the header first and then check.
-		 */
-		if (have_private_ref)
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
-		else
-			buf_state = LockBufHdr(bufHdr);
-
-		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
+		if (!PinBuffer(bufHdr, NULL) ||
+			!BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * It's now safe to pin the buffer.  We can't pin first and ask
-			 * questions later, because it might confuse code paths like
-			 * InvalidateBuffer() if we pinned a random non-matching buffer.
-			 */
-			if (have_private_ref)
-				PinBuffer(bufHdr, NULL);	/* bump pin count */
-			else
-				PinBuffer_Locked(bufHdr);	/* pin for first time */
-
-			pgBufferUsage.shared_blks_hit++;
-
-			return true;
+			UnpinBuffer(bufHdr);
+			return false;
 		}
 
-		/* If we locked the header above, now unlock. */
-		if (!have_private_ref)
-			UnlockBufHdr(bufHdr, buf_state);
+		pgBufferUsage.shared_blks_hit++;
+		return true;
 	}
 
 	return false;
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 3016501ac05..443c492cf34 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -97,6 +97,10 @@ typedef struct PgBackendGSSStatus
  */
 typedef struct PgBackendStatus
 {
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(PG_CACHE_LINE_SIZE)
+#endif
+
 	/*
 	 * To avoid locking overhead, we use the following protocol: a backend
 	 * increments st_changecount before modifying its entry, and again after
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b552359915f..7ac2fb1512f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -25,6 +25,7 @@
 #include "rewrite/prs2lock.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
+#include "storage/buf.h"
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
@@ -159,6 +160,9 @@ typedef struct RelationData
 
 	/* data managed by RelationGetIndexAttrBitmap: */
 	bool		rd_attrsvalid;	/* are bitmaps of attrs valid? */
+
+	Buffer		rd_recent_root;
+
 	Bitmapset  *rd_keyattr;		/* cols that can be ref'd by foreign keys */
 	Bitmapset  *rd_pkattr;		/* cols included in primary key */
 	Bitmapset  *rd_idattr;		/* included in replica identity index */
-- 
2.50.1

v20250807-0001-allow-pgbench-to-pin-backends-threads.patchtext/x-patch; charset=UTF-8; name=v20250807-0001-allow-pgbench-to-pin-backends-threads.patchDownload

From 9097f04069ab604146606a7f6eabe9a2ea78dca0 Mon Sep 17 00:00:00 2001
From: Ubuntu
 <azureuser@hb176.osdv23jxzvxutpvofmy4uk5hcd.bx.internal.cloudapp.net>
Date: Sat, 31 May 2025 00:45:47 +0000
Subject: [PATCH v20250807 01/11] allow pgbench to pin backends/threads

Supported policies are "none" (default), "colocated" (both processes on
the same random core) and "random" (both processes on different cores).
---
 src/bin/pgbench/pgbench.c | 348 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 348 insertions(+)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 125f3c7bbbe..95479861177 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <time.h>
 #include <sys/time.h>
 #include <sys/resource.h>		/* for getrlimit */
+#include <sys/sysinfo.h>
 
 /* For testing, PGBENCH_USE_SELECT can be defined to force use of that code */
 #if defined(HAVE_PPOLL) && !defined(PGBENCH_USE_SELECT)
@@ -300,6 +301,12 @@ static const char *dbName = NULL;
 static char *logfile_prefix = NULL;
 static const char *progname;
 
+#define	CPU_PINNING_NONE		0
+#define	CPU_PINNING_RANDOM		1
+#define	CPU_PINNING_COLOCATED	2
+
+static int	pinning_mode = CPU_PINNING_NONE;
+
 #define WSEP '@'				/* weight separator */
 
 static volatile sig_atomic_t timer_exceeded = false;	/* flag from signal
@@ -639,6 +646,8 @@ typedef struct
 	int64		cnt;			/* client transaction count, for -t; skipped
 								 * and failed transactions are also counted
 								 * here */
+
+	int			cpu;			/* CPU to pin to, -1 = no pinning */
 } CState;
 
 /*
@@ -672,6 +681,8 @@ typedef struct
 
 	StatsData	stats;
 	int64		latency_late;	/* count executed but late transactions */
+
+	int			cpu;			/* CPU to pin to, -1 = no pinning */
 } TState;
 
 /*
@@ -841,6 +852,45 @@ static void add_socket_to_set(socket_set *sa, int fd, int idx);
 static int	wait_on_socket_set(socket_set *sa, int64 usecs);
 static bool socket_has_input(socket_set *sa, int fd, int idx);
 
+/*
+ * random number generator for assigning CPUs to processes/threads
+ *
+ * This is not a simple generator, because we want to keep the outcome
+ * balanced - it's not just about the total number of threads/processes
+ * assigned to each core, because a thread is likely much lighter than a
+ * backend. So we want to distribute each of those uniformly.
+ *
+ * For example, assume we have 2 cores, and want to pin 2 threads and 2
+ * backends. We could pin 2 threads to core 0 and 2 backends to core 1.
+ * But that would not be balanced, because core 1 will have to execute
+ * much more work. Instead, we want to assign 1+1 to each core.
+ *
+ * Another restriction is that we don't want to assign both sides of
+ * a connection to the same core.
+ *
+ * All of that however applies to the 'random' mode. In the 'colocated'
+ * mode, we want to assign both sides of a connection to the same core.
+ * But we still pick the core randomly.
+ */
+typedef struct cpu_generator_state
+{
+	int			ncpus;			/* number of CPUs available */
+	int			nitems;			/* number of items in the queue */
+	int		   *nthreads;		/* number of threads for each CPU */
+	int		   *nclients;		/* number of processes for each CPU */
+	int		   *items;			/* queue of CPUs to pick from */
+}			cpu_generator_state;
+
+static cpu_generator_state cpu_generator_init(int ncpus);
+static void cpu_generator_refill(cpu_generator_state * state);
+static void cpu_generator_reset(cpu_generator_state * state);
+static int	cpu_generator_thread(cpu_generator_state * state);
+static int	cpu_generator_client(cpu_generator_state * state, int thread_cpu);
+static void cpu_generator_print(cpu_generator_state * state);
+static bool cpu_generator_check(cpu_generator_state * state);
+
+static void reset_pinning(TState *threads, int nthreads);
+
 /* callback used to build rows for COPY during data loading */
 typedef void (*initRowMethod) (PQExpBufferData *sql, int64 curr);
 
@@ -959,6 +1009,7 @@ usage(void)
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g., 0.01 for 1%%)\n"
 		   "  --show-script=NAME       show builtin script code, then exit\n"
 		   "  --verbose-errors         print messages of all errors\n"
+		   "  --pin-cpus MODE          pin threads and backends to CPU (random or colocated)\n"
 		   "\nCommon options:\n"
 		   "  --debug                  print debugging output\n"
 		   "  -d, --dbname=DBNAME      database name to connect to\n"
@@ -6718,6 +6769,7 @@ main(int argc, char **argv)
 		{"verbose-errors", no_argument, NULL, 15},
 		{"exit-on-abort", no_argument, NULL, 16},
 		{"debug", no_argument, NULL, 17},
+		{"pin-cpus", required_argument, NULL, 18},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -7071,6 +7123,15 @@ main(int argc, char **argv)
 			case 17:			/* debug */
 				pg_logging_increase_verbosity();
 				break;
+			case 18:			/* pin CPUs */
+				if (pg_strcasecmp(optarg, "random") == 0)
+					pinning_mode = CPU_PINNING_RANDOM;
+				else if (pg_strcasecmp(optarg, "colocated") == 0)
+					pinning_mode = CPU_PINNING_COLOCATED;
+				else
+					pg_fatal("invalid pinning method, expecting \"random\" or \"colocated\", got: \"%s\"",
+							 optarg);
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7368,6 +7429,96 @@ main(int argc, char **argv)
 		nclients_dealt += thread->nstate;
 	}
 
+	/* reset the CPU assignment, i.e. no pinning by default */
+	reset_pinning(threads, nthreads);
+
+	/* try to assign threads/clients to CPUs */
+	if (pinning_mode != CPU_PINNING_NONE)
+	{
+		int			nprocs = get_nprocs();
+		cpu_generator_state state = cpu_generator_init(nprocs);
+
+retry:
+		/* start from scratch */
+		cpu_generator_reset(&state);
+
+		/* assign CPU to all threads */
+		for (i = 0; i < nthreads; i++)
+		{
+			TState	   *thread = &threads[i];
+
+			thread->cpu = cpu_generator_thread(&state);
+		}
+
+		/*
+		 * assign CPUs to backends, one at a time (for each thread)
+		 *
+		 * This helps to keep it balanced in the 'random' mode.
+		 */
+		while (true)
+		{
+			/* did we find any unassigned backend? */
+			bool		found = false;
+
+			for (i = 0; i < nthreads; i++)
+			{
+				TState	   *thread = &threads[i];
+
+				for (int j = 0; j < thread->nstate; j++)
+				{
+					/* skip backends with already assigned CPUs */
+					if (thread->state[j].cpu != -1)
+						continue;
+
+					if (pinning_mode == CPU_PINNING_RANDOM)
+						thread->state[j].cpu = cpu_generator_client(&state, thread->cpu);
+					else
+					{
+						state.nclients[thread->cpu]++;
+						thread->state[j].cpu = thread->cpu;
+					}
+
+					/* move to the next thread */
+					found = true;
+					break;
+				}
+			}
+
+			/* if we haven't found any unassigned backend, we're done */
+			if (!found)
+				break;
+		}
+
+		/* validate the assignments don't violate any of the restrictions */
+		if ((pinning_mode == CPU_PINNING_RANDOM) &&
+			!cpu_generator_check(&state))
+		{
+			reset_pinning(threads, nthreads);
+			goto retry;
+		}
+
+		/* print info about CPU assignments */
+		printf("============================\n");
+
+		cpu_generator_print(&state);
+
+		printf("----------------------------\n");
+
+		for (i = 0; i < nthreads; i++)
+		{
+			TState	   *thread = &threads[i];
+
+			printf("thread %d CPU %d\n", i, thread->cpu);
+
+			for (int j = 0; j < thread->nstate; j++)
+			{
+				printf("  client %d CPU %d\n", j, thread->state[j].cpu);
+			}
+		}
+
+		printf("============================\n");
+	}
+
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
@@ -7519,6 +7670,39 @@ threadRun(void *arg)
 		}
 	}
 
+	/* pin the thread and backends for it's connections to the same CPU */
+	if (pinning_mode != CPU_PINNING_NONE)
+	{
+		cpu_set_t  *cpusetp;
+		size_t		cpusetsize;
+		size_t		nprocs = get_nprocs();
+
+		cpusetp = CPU_ALLOC(nprocs);
+		cpusetsize = CPU_ALLOC_SIZE(nprocs);
+
+		CPU_ZERO_S(cpusetsize, cpusetp);
+		CPU_SET_S(thread->cpu, cpusetsize, cpusetp);
+
+		/* thread 0 dess not have an actual pthread */
+		if (thread->thread)
+			pthread_setaffinity_np(thread->thread, cpusetsize, cpusetp);
+		else
+			sched_setaffinity(getpid(), cpusetsize, cpusetp);
+
+		/* determine PID of the backend, pin it to the same CPU */
+		for (int i = 0; i < nstate; i++)
+		{
+			pid_t		pid = PQbackendPID(state[i].con);
+
+			CPU_ZERO_S(cpusetsize, cpusetp);
+			CPU_SET_S(state[i].cpu, cpusetsize, cpusetp);
+
+			sched_setaffinity(pid, cpusetsize, cpusetp);
+		}
+
+		CPU_FREE(cpusetp);
+	}
+
 	/* GO */
 	THREAD_BARRIER_WAIT(&barrier);
 
@@ -7992,3 +8176,167 @@ socket_has_input(socket_set *sa, int fd, int idx)
 }
 
 #endif							/* POLL_USING_SELECT */
+
+static cpu_generator_state
+cpu_generator_init(int ncpus)
+{
+	struct timeval tv;
+
+	cpu_generator_state state;
+
+	state.ncpus = ncpus;
+
+	state.nthreads = pg_malloc(sizeof(int) * ncpus);
+	state.nclients = pg_malloc(sizeof(int) * ncpus);
+
+	state.nitems = ncpus;
+	state.items = pg_malloc(sizeof(int) * ncpus);
+	for (int i = 0; i < ncpus; i++)
+	{
+		state.nthreads[i] = 0;
+		state.nclients[i] = 0;
+		state.items[i] = i;
+	}
+
+	gettimeofday(&tv, NULL);
+	srand48(tv.tv_usec);
+
+	return state;
+}
+
+static void
+cpu_generator_refill(cpu_generator_state * state)
+{
+	struct timeval tv;
+
+	state->items = pg_realloc(state->items,
+							  (state->nitems + state->ncpus) * sizeof(int));
+
+	gettimeofday(&tv, NULL);
+
+	srand48(tv.tv_usec);
+
+	for (int i = 0; i < state->ncpus; i++)
+		state->items[state->nitems++] = i;
+}
+
+static void
+cpu_generator_reset(cpu_generator_state * state)
+{
+	state->nitems = 0;
+	cpu_generator_refill(state);
+
+	for (int i = 0; i < state->ncpus; i++)
+	{
+		state->nthreads[i] = 0;
+		state->nclients[i] = 0;
+	}
+}
+
+static int
+cpu_generator_thread(cpu_generator_state * state)
+{
+	if (state->nitems == 0)
+		cpu_generator_refill(state);
+
+	while (true)
+	{
+		int			idx = lrand48() % state->nitems;
+		int			cpu = state->items[idx];
+
+		state->items[idx] = state->items[state->nitems - 1];
+		state->nitems--;
+
+		state->nthreads[cpu]++;
+
+		return cpu;
+	}
+}
+
+static int
+cpu_generator_client(cpu_generator_state * state, int thread_cpu)
+{
+	int			min_clients;
+	bool		has_valid_cpus = false;
+
+	for (int i = 0; i < state->nitems; i++)
+	{
+		if (state->items[i] != thread_cpu)
+		{
+			has_valid_cpus = true;
+			break;
+		}
+	}
+
+	if (!has_valid_cpus)
+		cpu_generator_refill(state);
+
+	min_clients = INT_MAX;
+	for (int i = 0; i < state->nitems; i++)
+	{
+		if (state->items[i] == thread_cpu)
+			continue;
+
+		min_clients = Min(min_clients, state->nclients[state->items[i]]);
+	}
+
+	while (true)
+	{
+		int			idx = lrand48() % state->nitems;
+		int			cpu = state->items[idx];
+
+		if (cpu == thread_cpu)
+			continue;
+
+		if (state->nclients[cpu] != min_clients)
+			continue;
+
+		state->items[idx] = state->items[state->nitems - 1];
+		state->nitems--;
+
+		state->nclients[cpu]++;
+
+		return cpu;
+	}
+}
+
+static void
+cpu_generator_print(cpu_generator_state * state)
+{
+	for (int i = 0; i < state->ncpus; i++)
+	{
+		printf("CPU %d threads %d clients %d\n", i, state->nthreads[i], state->nclients[i]);
+	}
+}
+
+static bool
+cpu_generator_check(cpu_generator_state * state)
+{
+	int			min_count = INT_MAX,
+				max_count = 0;
+
+	for (int i = 0; i < state->ncpus; i++)
+	{
+		min_count = Min(min_count, state->nthreads[i] + state->nclients[i]);
+		max_count = Max(max_count, state->nthreads[i] + state->nclients[i]);
+	}
+
+	return (max_count - min_count <= 1);
+}
+
+static void
+reset_pinning(TState *threads, int nthreads)
+{
+	/* reset the CPU assignment, i.e. no pinning by default */
+	for (int i = 0; i < nthreads; i++)
+	{
+		TState	   *thread = &threads[i];
+
+		thread->cpu = -1;
+
+		for (int j = 0; j < thread->nstate; j++)
+		{
+			thread->state[j].cpu = -1;
+		}
+	}
+}
-- 
2.50.1

#53

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#52)

Re: Adding basic NUMA awareness

On 8/7/25 11:24, Tomas Vondra wrote:

Hi!

Here's a slightly improved version of the patch series.

Ah, I made a mistake when generating the patches. The 0001 and 0002
patches are not part of the NUMA stuff, it's just something related to
benchmarking (addressing unrelated bottlenecks etc.). The actual NUMA
patches start with 0003.

Also, 0007, 0008 and 0009 should ultimately be collapsed into a single
patch. It's all about the clocksweep partitioning, I only kept those
separate to make it easier to see the changes and review.

regards

--
Tomas Vondra

#54

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#52)

Re: Adding basic NUMA awareness

Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:

2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
assumes each core / piece of RAM is assigned to a particular NUMA node.

There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if
you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory
controller. In that case numactl -H (and obviously also the libnuma APIs) will
report that the numa node is not associated with any CPU.

I don't currently have live access to such a system, but this PR piece happens
to have numactl -H output:
https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers

numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 1031904 MB
node 0 free: 1025554 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 1 size: 1032105 MB
node 1 free: 1024244 MB
node 2 cpus:
node 2 size: 262144 MB
node 2 free: 262143 MB
node 3 cpus:
node 3 size: 262144 MB
node 3 free: 262142 MB
node distances:
node 0 1 2 3
0: 10 21 14 24
1: 21 10 24 14
2: 14 24 10 26
3: 24 14 26 10

Note that node 2 & 3 don't have associated CPUs (and higher access costs).

I don't think this is common enough to worry about from a performance POV, but
we probably shouldn't crash if we encounter it...

But it also cares about the cores (and the node for each core), because
it uses that to pick the right partition for a backend. And here the
situation is less clear, because the CPUs don't need to be assigned to a
particular node, even on a NUMA system. Consider the rpi5 NUMA layout:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 992 MB
node 0 free: 274 MB
node 1 cpus: 0 1 2 3
node 1 size: 1019 MB
node 1 free: 327 MB
...
node 0 1 2 3 4 5 6 7
0: 10 10 10 10 10 10 10 10
1: 10 10 10 10 10 10 10 10
2: 10 10 10 10 10 10 10 10
3: 10 10 10 10 10 10 10 10
4: 10 10 10 10 10 10 10 10
5: 10 10 10 10 10 10 10 10
6: 10 10 10 10 10 10 10 10
7: 10 10 10 10 10 10 10 10
This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
are not assigned to particular nodes - each core is mapped to all 8 NUMA
nodes.

FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as
NUMA" is enabled.

I'm not sure what to do about this (or how getcpu() or libnuma handle this).

I don't immediately see any libnuma functions that would care?

I also am somewhat curious about what getcpu() returns for the current node...

Greetings,

Andres Freund

#55

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#54)

Re: Adding basic NUMA awareness

On 8/9/25 02:25, Andres Freund wrote:

Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:

2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
assumes each core / piece of RAM is assigned to a particular NUMA node.

There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if
you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory
controller. In that case numactl -H (and obviously also the libnuma APIs) will
report that the numa node is not associated with any CPU.

I don't currently have live access to such a system, but this PR piece happens
to have numactl -H output:
https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers

numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 1031904 MB
node 0 free: 1025554 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 1 size: 1032105 MB
node 1 free: 1024244 MB
node 2 cpus:
node 2 size: 262144 MB
node 2 free: 262143 MB
node 3 cpus:
node 3 size: 262144 MB
node 3 free: 262142 MB
node distances:
node 0 1 2 3
0: 10 21 14 24
1: 21 10 24 14
2: 14 24 10 26
3: 24 14 26 10

Note that node 2 & 3 don't have associated CPUs (and higher access costs).

I don't think this is common enough to worry about from a performance POV, but
we probably shouldn't crash if we encounter it...

Right. I don't think the current patch would crash - I can't test it,
but I don't see why it would crash. In the worst case it'd end up with
partitions that are not ideal. The question is more what would an ideal
partitioning for buffers and PGPROC look like. Any opinions?

For PGPROC, it's simple - it doesn't make sense to allocate partitions
for nodes without CPUs.

For buffers, it probably does not really matter if a node does not have
any CPUs. If a node does not have any CPUs, that does not mean we should
not put any buffers on it. After all, CXL will never have any CPUs (at
least I think that's the case), and not using it for shared buffers
would be a bit strange. Although, it could still be used for page cache.

Maybe it should be "tiered" a bit more? The patch differentiates only
between partitions on "my" NUMA node vs. every other partition. Maybe it
should have more layers?

That'd make the "balancing" harder. But I wanted to make it a smarter
when handling cases with multiple partitions per NUMA node.

But it also cares about the cores (and the node for each core), because
it uses that to pick the right partition for a backend. And here the
situation is less clear, because the CPUs don't need to be assigned to a
particular node, even on a NUMA system. Consider the rpi5 NUMA layout:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 992 MB
node 0 free: 274 MB
node 1 cpus: 0 1 2 3
node 1 size: 1019 MB
node 1 free: 327 MB
...
node 0 1 2 3 4 5 6 7
0: 10 10 10 10 10 10 10 10
1: 10 10 10 10 10 10 10 10
2: 10 10 10 10 10 10 10 10
3: 10 10 10 10 10 10 10 10
4: 10 10 10 10 10 10 10 10
5: 10 10 10 10 10 10 10 10
6: 10 10 10 10 10 10 10 10
7: 10 10 10 10 10 10 10 10
This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
are not assigned to particular nodes - each core is mapped to all 8 NUMA
nodes.

FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as
NUMA" is enabled.

I'm not sure what to do about this (or how getcpu() or libnuma handle this).

I don't immediately see any libnuma functions that would care?

Not sure what "care" means here. I don't think it's necessarily broken,
it's more about the APIs not making the situation very clear (or
convenient).

How do you determine nodes for a CPU, for example? The closest thing I
see is numa_node_of_cpu(), but that only returns a single node. Or how
would you determine the number of nodes with CPUs (so that we create
PGPROC partitions only for those)? I suppose that requires literally
walking all the nodes.

I also am somewhat curious about what getcpu() returns for the current node...

It seems it only returns node 0. The cpu changes, but the node does not.

regards

--
Tomas Vondra

#56

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#55)

Re: Adding basic NUMA awareness

Hi,

On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote:

Right. I don't think the current patch would crash - I can't test it,
but I don't see why it would crash. In the worst case it'd end up with
partitions that are not ideal. The question is more what would an ideal
partitioning for buffers and PGPROC look like. Any opinions?

For PGPROC, it's simple - it doesn't make sense to allocate partitions
for nodes without CPUs.

For buffers, it probably does not really matter if a node does not have
any CPUs. If a node does not have any CPUs, that does not mean we should
not put any buffers on it. After all, CXL will never have any CPUs (at
least I think that's the case), and not using it for shared buffers
would be a bit strange. Although, it could still be used for page cache.

For CXL memory to be really usable, I think we'd need nontrivial additional
work. CXL memory has considerably higher latency and lower throughput. You'd
*never* want things like BufferDescs or such on such nodes. And even the
buffered data itself, you'd want to make sure that frequently used data,
e.g. inner index pages, never end up on it.

Which leads to:

Maybe it should be "tiered" a bit more?

Yes, for proper CXL support, we'd need a component that explicitly demotes and
promotes pages from "real" memory to CXL memory and the other way round. The
demotion is relatively easy, you'd probably just do it whenever you'd
otherwise throw out a victim buffer. When to promote back is harder...

The patch differentiates only between partitions on "my" NUMA node vs. every
other partition. Maybe it should have more layers?

Given the relative unavailability of CXL memory systems, I think just not
crashing is good enough for now...

I'm not sure what to do about this (or how getcpu() or libnuma handle this).

I don't immediately see any libnuma functions that would care?

Not sure what "care" means here. I don't think it's necessarily broken,
it's more about the APIs not making the situation very clear (or
convenient).

What I mean is that I was looking through the libnuma functions and didn't see
any that would be affected by having multiple "local" NUMA nodes. But:

How do you determine nodes for a CPU, for example? The closest thing I
see is numa_node_of_cpu(), but that only returns a single node. Or how
would you determine the number of nodes with CPUs (so that we create
PGPROC partitions only for those)? I suppose that requires literally
walking all the nodes.

I didn't think of numa_node_of_cpu().

As long as numa_node_of_cpu() returns *something* I think it may be good
enough. Nobody uses an RPi for high-throughput postgres workloads with a lot
of memory. Slightly sub-optimal mappings should really not matter.

I'm kinda wondering if we should deal with such fake numa systems by detecting
them and disabling our numa support.

Greetings,

Andres Freund

#57

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#56)

Re: Adding basic NUMA awareness

On 8/12/25 16:24, Andres Freund wrote:

Hi,

On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote:

Right. I don't think the current patch would crash - I can't test it,
but I don't see why it would crash. In the worst case it'd end up with
partitions that are not ideal. The question is more what would an ideal
partitioning for buffers and PGPROC look like. Any opinions?

For PGPROC, it's simple - it doesn't make sense to allocate partitions
for nodes without CPUs.

For buffers, it probably does not really matter if a node does not have
any CPUs. If a node does not have any CPUs, that does not mean we should
not put any buffers on it. After all, CXL will never have any CPUs (at
least I think that's the case), and not using it for shared buffers
would be a bit strange. Although, it could still be used for page cache.

For CXL memory to be really usable, I think we'd need nontrivial additional
work. CXL memory has considerably higher latency and lower throughput. You'd
*never* want things like BufferDescs or such on such nodes. And even the
buffered data itself, you'd want to make sure that frequently used data,
e.g. inner index pages, never end up on it.

OK, let's keep that out of scope for these patches and assume we're
dealing only with local memory. CXL could still be used by the OS for
page cache, of whatever.

What does that mean for the patch, though. Does it need a way to
configure which nodes to use? I argued to leave this to the OS/numactl,
and we'd just use whatever is made available to Postgres. But maybe
we'll need something within Postgres after all?

FWIW there's work needed a actually inherit NUMA info from the OS. Right
now the patches just use all NUMA nodes, indexed by 0 ... (N-1) etc. I
like the "registry" concept I used for buffer/PGPROC partitions, it made
the patches much simpler. Maybe we should use something like that for
NUMA info too. That is, at startup build a record of the NUMA layout,
and use this as source of truth everywhere (instead of using libnuma
from all those places).

Which leads to:

Maybe it should be "tiered" a bit more?

Yes, for proper CXL support, we'd need a component that explicitly demotes and
promotes pages from "real" memory to CXL memory and the other way round. The
demotion is relatively easy, you'd probably just do it whenever you'd
otherwise throw out a victim buffer. When to promote back is harder...

Sounds very much like page cache (but that only works for buffered I/O).

The patch differentiates only between partitions on "my" NUMA node vs. every
other partition. Maybe it should have more layers?

Given the relative unavailability of CXL memory systems, I think just not
crashing is good enough for now...

The lowest of bars ;-)

I'm not sure what to do about this (or how getcpu() or libnuma handle this).

I don't immediately see any libnuma functions that would care?

Not sure what "care" means here. I don't think it's necessarily broken,
it's more about the APIs not making the situation very clear (or
convenient).

What I mean is that I was looking through the libnuma functions and didn't see
any that would be affected by having multiple "local" NUMA nodes. But:

My question is a bit of a "reverse" to this. That is, how do we even
find (with libnuma) there are multiple local nodes?

How do you determine nodes for a CPU, for example? The closest thing I
see is numa_node_of_cpu(), but that only returns a single node. Or how
would you determine the number of nodes with CPUs (so that we create
PGPROC partitions only for those)? I suppose that requires literally
walking all the nodes.

I didn't think of numa_node_of_cpu().

Yeah. I think most of the libnuma API is designed for each CPU belonging
to single NUMA node. I suppose we'd need to use numa_node_to_cpus() to
build this kind of information ourselves.

As long as numa_node_of_cpu() returns *something* I think it may be good
enough. Nobody uses an RPi for high-throughput postgres workloads with a lot
of memory. Slightly sub-optimal mappings should really not matter.

I'm not really concerned about rpi, or the performance on it. I only use
it as an example of system with "weird" NUMA layout.

I'm kinda wondering if we should deal with such fake numa systems by detecting
them and disabling our numa support.

That'd be an option too, if we can identify such systems. We could do
that while building the "NUMA registry" I mentioned earlier.

regards

--
Tomas Vondra

#58

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#52)

Re: Adding basic NUMA awareness

Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:

The patch does a much simpler thing - treat the weight as a "budget",
i.e. number of buffers to allocate before proceeding to the "next"
partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
and then goes back to P1 in a round-robin way. The advantage is it can
do away without a PRNG.

I think that's a good plan.

A few comments about the clock sweep patch:

- It'd be easier to review if BgBufferSync() weren't basically re-indented
wholesale. Maybe you could instead move the relevant code to a helper
function that's called by BgBufferSync() for each clock?

- I think choosing a clock sweep partition in every tick would likely show up
in workloads that do a lot of buffer replacement, particularly if buffers
in the workload often have a high usagecount (and thus more ticks are used).
Given that your balancing approach "sticks" with a partition for a while,
could we perhaps only choose the partition after exhausting that budget?

- I don't really understand what

+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+

is intended for.

Adding yet another atomic increment for every clock sweep tick seems rather
expensive...

- I wonder if the balancing budgets being relatively low will be good
enough. It's not too hard to imagine that this frequent "partition choosing"
will be bad in buffer access heavy workloads. But it's probably the right
approach until we've measured it being a problem.

- It'd be interesting to do some very simple evaluation like a single
pg_prewarm() of a relation that's close to the size of shared buffers and
verify that we don't end up evicting newly read in buffers. I think your
approach should work, but verifying that...

I wonder if we could make some of this into tests somehow. It's pretty easy
to break this kind of thing and not notice, as everything just continues to
work, just a tad slower.

Greetings,

Andres Freund

#59

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#58)

Re: Adding basic NUMA awareness

On 8/13/25 17:16, Andres Freund wrote:

Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:

The patch does a much simpler thing - treat the weight as a "budget",
i.e. number of buffers to allocate before proceeding to the "next"
partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
and then goes back to P1 in a round-robin way. The advantage is it can
do away without a PRNG.

I think that's a good plan.

A few comments about the clock sweep patch:

- It'd be easier to review if BgBufferSync() weren't basically re-indented
wholesale. Maybe you could instead move the relevant code to a helper
function that's called by BgBufferSync() for each clock?

True, I'll rework it like that.

- I think choosing a clock sweep partition in every tick would likely show up
in workloads that do a lot of buffer replacement, particularly if buffers
in the workload often have a high usagecount (and thus more ticks are used).
Given that your balancing approach "sticks" with a partition for a while,
could we perhaps only choose the partition after exhausting that budget?

That should be possible, yes. By "exhausting budget" you mean going
through all the partitions, right?

- I don't really understand what
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
is intended for.

Adding yet another atomic increment for every clock sweep tick seems rather
expensive...

For the balancing (to calculate the budgets), we need to know the number
of allocation requests for each partition, before some of the requests
got redirected to other partitions. We can't use the number of "actual"
allocations. But it seems useful to have both - one to calculate the
budgets, the other to monitor how balanced the result is.

I haven't seen the extra atomic in profiles, even on workloads that do a
lot of buffer allocations (e.g. seqscan with datasets > shared buffers).
But if that happens, I think there are ways to mitigate that.

- I wonder if the balancing budgets being relatively low will be good
enough. It's not too hard to imagine that this frequent "partition choosing"
will be bad in buffer access heavy workloads. But it's probably the right
approach until we've measured it being a problem.

I don't follow. How would making the budgets higher change any of this?

Anyway, I think choosing the partitions less frequently - e.g. only
after consuming budget for the current partition, or going "full cycle",
would make this a non-issue.

- It'd be interesting to do some very simple evaluation like a single
pg_prewarm() of a relation that's close to the size of shared buffers and
verify that we don't end up evicting newly read in buffers. I think your
approach should work, but verifying that...

Will try.

I wonder if we could make some of this into tests somehow. It's pretty easy
to break this kind of thing and not notice, as everything just continues to
work, just a tad slower.

Do you mean a test that'd be a part of make check, or a standalone test?
AFAICS any meaningful test would need to be fairly expensive, so
probably not a good fit for make check.

regards

--
Tomas Vondra

#60

tomas@vondra.me

4 months ago

In reply to: Tomas Vondra (#59)

5 attachment(s)

Re: Adding basic NUMA awareness

Hi,

Here's a fresh version of the NUMA patch series. There's a number of
substantial improvements:

1) Rebase to current master, particularly on top of 2c7894052759 which
removed the freelist. The patch that partitioned the freelist is gone.

2) A bunch of fixes, and it now passes CI workflows on github, and all
other testing I did. Of course, more testing is needed.

3) It builds with/without libnuma support, and so on.

4) The separate GUCs were replaced by a single list GUC, similar to what
we do for debug_io_direct. The GUC is called debug_numa, and accepts
"buffets" and "procs", to partition the two areas.

I'm considering adding a "clock-sweep" option in a future patch. I
didn't do that here, because the buffer partitioning is already enabled
by "buffers", and the clock-sweep just builds on that (partitioning the
same way, pretty much).

5) It also works with EXEC_BACKEND, but this turned out to be a bit more
challenging than expected. The trouble is some of the parameters (e.g.
memory page size) are used both in the "size" and "init" phases, and we
need to be extra careful to not make something "inconsistent".

For example, we may get confused about the memory page size. The "size"
happens before allocation, and at that point we don't know if we succeed
in getting enough huge pages. When "init" happens, we already know that,
so our "memory page size" could be different. We must be careful, e.g.
to not need more memory than we requested.

This is a general problem, but the EXEC_BACKEND makes it a bit trickier.
In regular fork() case we can simply set some global variables in "size"
and use them later in "init", but that doesn't work for EXEC_BACKEND.

The current approach simply does the calculations twice, in a way that
should end with the same results. But I'm not 100% happy with it, and I
suspect it just confirms we should store the results in shmem memory
(which is what I called "NUMA registry" before). That's still a TODO.

6) BufferDescPadded is 64B everywhere. Originally, the padding was
applied only on 64-bit platforms, and on 32-bit systems the struct was
left at 60B. But that is not compatible with buffer partitioning, which
relies on the memory page being a multiple of BufferDescPadded.

Perhaps it could be relaxed (so that a BufferDesc might span two memory
pages), but this seems like the cleanes solution. I don't expect it to
make any measurable difference.

7) I've moved some of the code to BgBufferSyncPartition, to make review
easier (without all the indentation changes).

8) I've realized some of the TAP tests occasionally fail with

ERROR: no unpinned buffers

and I think I know why. Some of the tests set shared_buffers to a very
low value - like 1MB or even 128kB, and StrategyGetBuffer() may search
only a single partition (but not always). We may run out of unpinned
buffers in that one partition.

This apparently happens more easily on rpi5, due to the weird NUMA
layout (there are 8 nodes with memory, but getcpu() reports node 0 for
all cores).

I suspect the correct fix is to ensure StrategyGetBuffer() scans all
partitions, if there are no unpinned buffers in the current one. On
realistic setups this shouldn't happen very often, I think.

The other issue I just realized is that StrategyGetBuffer() recalculates
the partition index over and over, which seems unnecessary (and possibly
expensive, due to the modulo). And it also does too many loops, because
it used NBuffers instead of the partition size. I'll fix those later.

9) I'm keeping the cloc-sweep balancing patches separate for now. In the
end all the clock-sweep patches should be merged, but it keeps the
changes easier to review this way, I think.

10) There's not many changes in the PGPROC partitioning patch. I merely
fixed issues that broke it on EXEC_BACKEND, and did some smaller tweaks.

11) I'm not including the experimental patches to pin backends to CPUs
(or nodes), and so on. It's clear those are unlikely to go in, so it'd
be just a distraction.

12) What's the right / portable way to determine the current CPU for a
process? The clock-sweep / PGPROC patches need this to pick the right
partition, but it's not clear to me which API is the most portable. In
the end I used sched_getcpu(), and then numa_node_of_cpu(), but maybe
there's a better way.

regards

--
Tomas Vondra

Attachments:

v20250911-0002-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v20250911-0002-NUMA-clockweep-partitioning.patchDownload

From 38ccd93b4a70d65e16da3303e9f94851c3a3fb5a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v20250911 2/8] NUMA: clockweep partitioning

Similar to the frelist patch - partition the "clocksweep" algorithm to
work on the sequence of smaller partitions, one by one.

It extends the "pg_buffercache_partitions" view to include information
about the clocksweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  32 +-
 src/backend/storage/buffer/buf_init.c         |  11 +
 src/backend/storage/buffer/bufmgr.c           | 186 +++++----
 src/backend/storage/buffer/freelist.c         | 353 ++++++++++++++++--
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 503 insertions(+), 103 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index fb9003c011e..6676e807034 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -16,7 +16,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8a0a4bd5cd6..c9dfc8a1b82 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	9
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -818,6 +818,14 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -839,12 +847,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -860,6 +878,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
+		values[5] = Int64GetDatum(complete_passes);
+		nulls[5] = false;
+
+		values[6] = Int32GetDatum(next_victim_buffer);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_total_allocs);
+		nulls[7] = false;
+
+		values[8] = Int64GetDatum(buffer_allocs);
+		nulls[8] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f51c7db7855..dd9f51529b4 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -716,6 +716,17 @@ BufferPartitionGet(int idx, int *node, int *num_buffers,
 	elog(ERROR, "invalid partition index");
 }
 
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions, int *num_nodes)
+{
+	if (num_partitions)
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
 /* XXX the GUC hooks should probably be somewhere else? */
 bool
 check_debug_numa(char **newval, void **extra, GucSource source)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..121134bb94c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3580,33 +3580,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3634,25 +3630,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3664,17 +3651,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3682,11 +3669,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3706,9 +3693,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3722,15 +3709,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3738,9 +3726,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3750,7 +3738,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3758,10 +3746,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3788,7 +3776,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3813,20 +3801,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3840,7 +3828,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3870,8 +3858,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..d5f8f28f562 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,34 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +132,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +144,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +152,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +185,117 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
+ */
+static int
+calculate_partition_index(void)
+{
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -222,9 +365,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -269,6 +438,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -276,37 +485,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -343,6 +559,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -350,6 +569,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -365,6 +588,18 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -382,7 +617,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -394,15 +630,44 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
+
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
@@ -739,3 +1004,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 294188e21c5..5cce690933b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,7 +439,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -485,5 +487,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e52fca9e483..9ade69e53b5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -355,6 +355,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50195718294..b68f75b7f31 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -427,6 +427,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.0

v20250911-0003-NUMA-clocksweep-allocation-balancing.patchtext/x-patch; charset=UTF-8; name=v20250911-0003-NUMA-clocksweep-allocation-balancing.patchDownload

From 6a4444f319394ac88abe27faf4ccba22ea1a0dbe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 10 Sep 2025 18:56:28 +0200
Subject: [PATCH v20250911 3/8] NUMA: clocksweep allocation balancing

If backends only allocate buffers from the "local" partition, this could
cause significant misbalance - some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 6676e807034..dc2ce019283 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -22,7 +22,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c9dfc8a1b82..f6831f60b9e 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	9
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -795,6 +797,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -826,6 +830,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -847,11 +857,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -860,8 +876,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -890,6 +914,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[9] = false;
+
+		values[10] = Int64GetDatum(buffer_req_allocs);
+		nulls[10] = false;
+
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 121134bb94c..8315105394d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3884,6 +3884,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index d5f8f28f562..349b626db3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -132,7 +168,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -144,7 +206,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -291,11 +353,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -365,7 +475,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -438,6 +548,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(LOG, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(LOG, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -464,6 +792,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -658,7 +987,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1007,8 +1350,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1016,11 +1361,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5cce690933b..1b4dae180e0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,6 +439,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 9ade69e53b5..fb4162eb764 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -356,11 +356,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.0

v20250911-0004-NUMA-weighted-clocksweep-balancing.patchtext/x-patch; charset=UTF-8; name=v20250911-0004-NUMA-weighted-clocksweep-balancing.patchDownload

From 391a25e8cbe88e9a018a5123b3d40ffec607891e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20250911 4/8] NUMA: weighted clocksweep balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 349b626db3b..5cf9a565914 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -605,6 +605,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -631,16 +645,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -648,8 +673,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -712,6 +743,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -719,7 +754,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -733,22 +768,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.0

v20250911-0005-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20250911-0005-NUMA-partition-PGPROC.patchDownload

From 4cdfff63d1a903902ce182b7e723ff6d2d09a84e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 8 Sep 2025 13:11:02 +0200
Subject: [PATCH v20250911 5/8] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 ++++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 +--
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 532 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 705 insertions(+), 70 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index dc2ce019283..306063e159e 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -33,3 +33,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index f6831f60b9e..a859962f5f8 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -30,6 +31,7 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -932,3 +935,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d8e2fce2c99..7745a197470 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index a38979c50e4..01356365eaa 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -106,8 +106,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1f142f20c7..011fecfc58b 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dd9f51529b4..2fd7f937ffb 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -766,6 +766,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5cf9a565914..a6657c0fc13 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -467,7 +467,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4cc7f645c31..d01f486876d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..08d1900fb59 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,32 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -76,8 +87,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -90,6 +101,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -100,11 +134,36 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+
 	return size;
 }
 
@@ -129,6 +188,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -140,12 +253,16 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
+	size = add_size(size, PGProcPartitionsShmemSize());
 
 	return size;
 }
@@ -191,7 +308,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -210,6 +327,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -224,6 +344,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +370,104 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +500,92 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
 
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
+
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
 		 * dummy PGPROCs don't need these though - they're never associated
@@ -366,9 +648,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -435,7 +714,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -646,7 +969,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1049,7 +1372,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1098,7 +1421,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1988,7 +2311,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2386,168 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ * doesn't do alignment
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	elog(DEBUG1, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b68f75b7f31..6a970998e5c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1876,6 +1876,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.0

v20250911-0001-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20250911-0001-NUMA-shared-buffers-partitioning.patchDownload

From 318f2441cc67daf7fb2ba672900a2bd4dfdea170 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 28 Jul 2025 14:01:37 +0200
Subject: [PATCH v20250911 1/8] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

This allows partitioning clock-sweep in a later patch, with one clock
hand per partition.

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, NUMA node, etc. This serves as a source
of truth, both for this patch and for later patches building on this
same buffer partition structure.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.
---
 contrib/pg_buffercache/Makefile               |   2 +-
 contrib/pg_buffercache/meson.build            |   1 +
 .../pg_buffercache--1.6--1.7.sql              |  26 +
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  92 +++
 src/backend/storage/buffer/buf_init.c         | 627 +++++++++++++++++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  16 +-
 src/include/storage/bufmgr.h                  |  23 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 src/tools/pgindent/typedefs.list              |   2 +
 14 files changed, 859 insertions(+), 16 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 7cd039a1df9..7c31141881f 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -24,6 +24,7 @@ install_data(
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache--1.5--1.6.sql',
+  'pg_buffercache--1.6--1.7.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..fb9003c011e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,26 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3df04c98959..8a0a4bd5cd6 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -777,3 +779,93 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &numa_node, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_buffers);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(first_buffer);
+		nulls[3] = false;
+
+		values[4] = Int32GetDatum(last_buffer);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..f51c7db7855 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,20 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +35,23 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_prepare(void);
+static void buffer_partitions_init(void);
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
 
 /*
  * Data Structures:
@@ -70,19 +98,87 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
-	/* Align descriptors to a cacheline boundary. */
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), numa_partitions),
+						&foundParts);
+
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +208,12 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -148,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -175,5 +284,505 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
+
 	return size;
 }
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or can't map to that many nodes, just
+	 * take a "fair share" of buffers.
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
+
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
+
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
+
+		remaining_buffers -= node_buffers;
+
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
+
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+
+			/* XXX we should get the actual node ID from the mask */
+			if ((numa_flags & NUMA_BUFFERS) != 0)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
+	}
+
+	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not have mapping
+	 * for each page.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+		char	   *startptr,
+				   *endptr;
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %zd)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %zd)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
+void
+BufferPartitionGet(int idx, int *node, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*node = part->numa_node;
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
+}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0da01627cfe..d18666a94cb 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -899,6 +899,16 @@
   boot_val => 'true',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'sync_replication_slots', type => 'bool', context => 'PGC_SIGHUP', group => 'REPLICATION_STANDBY',
   short_desc => 'Enables a physical standby to synchronize logical failover replication slots from the primary server.',
   variable => 'sync_replication_slots',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..d17ee9ca861 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..294188e21c5 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -275,10 +275,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -288,7 +288,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -321,6 +321,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -480,4 +481,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..e52fca9e483 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -143,6 +143,28 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			numa_node;		/* NUMA node (-1 no node) */
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -319,6 +341,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..8ee0e7d211c 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -106,6 +109,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -128,4 +161,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
+#endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
 #endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..50195718294 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -346,6 +346,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.0

#61

[1]: /messages/by-id/71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me
/messages/by-id/71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me

tomas@vondra.me

4 months ago

In reply to: Tomas Vondra (#60)

Re: Adding basic NUMA awareness

On 9/11/25 10:32, Tomas Vondra wrote:

...

For example, we may get confused about the memory page size. The "size"
happens before allocation, and at that point we don't know if we succeed
in getting enough huge pages. When "init" happens, we already know that,
so our "memory page size" could be different. We must be careful, e.g.
to not need more memory than we requested.

I forgot to mention the other issue with huge pages on NUMA. I already
reported [1]/messages/by-id/71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me it's trivial to crash with a SIGBUS, because

(1) huge pages get reserved on all NUMA nodes (evenly)

(2) the decision whether to use huge pages is done by mmap(), which only
needs to check if there are enough huge pages in total

(3) numa_tonode_memory is called later, and does not verify if the
target node has enough free pages (I'm not sure it should / can)

(4) we only partition (and locate to NUMA nodes) some of the memory, and
the rest (which is much smaller, but still sizeable) is likely causing
"imbalance" - it gets placed on one (random) node, and it then does not
have enough space for the stuff we explicitly placed there

(5) then at some point we try accessing one of the shared buffers, that
triggers page fault, tries to get a huge page on the NUMA node, realizes
there are no free huge pages, and crashes with SIGBUS

It clearly is not an option to just let it crash, but I still don't have
a great idea how to address it. The only idea I have is to manually
interleave the whole shared memory (when using huge pages), page by
page, so that this imbalance does not happen.

But it's harder than it looks, because we don't necessarily partition
everything evenly. For example, one node can get a smaller chunk of
shared buffers, because we try to partition buffers and buffers
descriptors in a "nice" way. The PGPROC stuff is also not distributed
quite evenly (e.g. aux/2pc entries are not mapped to any node).

A different approach would be to calculate how many per-node huge pages
we'll need (for the stuff we partition explicitly - buffers and PGPROC),
and then the rest of the memory that can get placed on any node. And
require the "maximum" number of pages that can get placed on any node.
But that's annoying wasteful, because every other node will end up with
unusable memory.

regards

--
Tomas Vondra

#62

tomas@vondra.me

4 months ago

In reply to: Tomas Vondra (#60)

6 attachment(s)

Re: Adding basic NUMA awareness

On 9/11/25 10:32, Tomas Vondra wrote:

...

8) I've realized some of the TAP tests occasionally fail with

ERROR: no unpinned buffers

and I think I know why. Some of the tests set shared_buffers to a very
low value - like 1MB or even 128kB, and StrategyGetBuffer() may search
only a single partition (but not always). We may run out of unpinned
buffers in that one partition.

This apparently happens more easily on rpi5, due to the weird NUMA
layout (there are 8 nodes with memory, but getcpu() reports node 0 for
all cores).

I suspect the correct fix is to ensure StrategyGetBuffer() scans all
partitions, if there are no unpinned buffers in the current one. On
realistic setups this shouldn't happen very often, I think.

The other issue I just realized is that StrategyGetBuffer() recalculates
the partition index over and over, which seems unnecessary (and possibly
expensive, due to the modulo). And it also does too many loops, because
it used NBuffers instead of the partition size. I'll fix those later.

Here's a version fixing this issue (in the 0006 part). It modifies
StrategyGetBuffer() to walk through all the partitions, in a round-robin
manner. The way it steps to the next partition is a bit ugly, but it
works and I'll think about some better way.

I haven't done anything about the other issue (the one with huge pages
reserved on NUMA nodes, and SIGBUS).

regards

--
Tomas Vondra

Attachments:

v20250918-0001-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20250918-0001-NUMA-shared-buffers-partitioning.patchDownload

From 9141bb89bce485873978e1ae3988b43e675ee047 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:04:29 +0200
Subject: [PATCH v20250918 1/6] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

This allows partitioning clock-sweep in a later patch, with one clock
hand per partition.

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, NUMA node, etc. This serves as a source
of truth, both for this patch and for later patches building on this
same buffer partition structure.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.
---
 contrib/pg_buffercache/Makefile               |   2 +-
 contrib/pg_buffercache/meson.build            |   1 +
 .../pg_buffercache--1.6--1.7.sql              |  26 +
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  92 +++
 src/backend/storage/buffer/buf_init.c         | 627 +++++++++++++++++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  16 +-
 src/include/storage/bufmgr.h                  |  23 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 src/tools/pgindent/typedefs.list              |   2 +
 14 files changed, 859 insertions(+), 16 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 7cd039a1df9..7c31141881f 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -24,6 +24,7 @@ install_data(
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache--1.5--1.6.sql',
+  'pg_buffercache--1.6--1.7.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..fb9003c011e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,26 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3df04c98959..8a0a4bd5cd6 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -777,3 +779,93 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &numa_node, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_buffers);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(first_buffer);
+		nulls[3] = false;
+
+		values[4] = Int32GetDatum(last_buffer);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..f51c7db7855 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,20 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +35,23 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_prepare(void);
+static void buffer_partitions_init(void);
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
 
 /*
  * Data Structures:
@@ -70,19 +98,87 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
-	/* Align descriptors to a cacheline boundary. */
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), numa_partitions),
+						&foundParts);
+
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +208,12 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -148,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -175,5 +284,505 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
+
 	return size;
 }
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or can't map to that many nodes, just
+	 * take a "fair share" of buffers.
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
+
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
+
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
+
+		remaining_buffers -= node_buffers;
+
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
+
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+
+			/* XXX we should get the actual node ID from the mask */
+			if ((numa_flags & NUMA_BUFFERS) != 0)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
+	}
+
+	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not have mapping
+	 * for each page.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+		char	   *startptr,
+				   *endptr;
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %zd)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %zd)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
+void
+BufferPartitionGet(int idx, int *node, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*node = part->numa_node;
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
+}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..65bef2b3c1b 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -899,6 +899,16 @@
   boot_val => 'true',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'sync_replication_slots', type => 'bool', context => 'PGC_SIGHUP', group => 'REPLICATION_STANDBY',
   short_desc => 'Enables a physical standby to synchronize logical failover replication slots from the primary server.',
   variable => 'sync_replication_slots',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..d17ee9ca861 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..294188e21c5 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -275,10 +275,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -288,7 +288,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -321,6 +321,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -480,4 +481,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 47360a3d3d8..1618caf1c2c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -146,6 +146,28 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			numa_node;		/* NUMA node (-1 no node) */
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* to avoid having to expose buf_internals.h here */
 typedef struct WritebackContext WritebackContext;
 
@@ -319,6 +341,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..8ee0e7d211c 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -106,6 +109,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -128,4 +161,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
+#endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
 #endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e90af5b2ad3..b3f0504008e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -346,6 +346,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.0

v20250918-0002-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v20250918-0002-NUMA-clockweep-partitioning.patchDownload

From e7145b7e771487012aa0c723309353723db93172 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v20250918 2/6] NUMA: clockweep partitioning

Similar to the frelist patch - partition the "clocksweep" algorithm to
work on the sequence of smaller partitions, one by one.

It extends the "pg_buffercache_partitions" view to include information
about the clocksweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  32 +-
 src/backend/storage/buffer/buf_init.c         |  11 +
 src/backend/storage/buffer/bufmgr.c           | 186 +++++----
 src/backend/storage/buffer/freelist.c         | 353 ++++++++++++++++--
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 503 insertions(+), 103 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index fb9003c011e..6676e807034 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -16,7 +16,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8a0a4bd5cd6..c9dfc8a1b82 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	9
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -818,6 +818,14 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -839,12 +847,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -860,6 +878,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
+		values[5] = Int64GetDatum(complete_passes);
+		nulls[5] = false;
+
+		values[6] = Int32GetDatum(next_victim_buffer);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_total_allocs);
+		nulls[7] = false;
+
+		values[8] = Int64GetDatum(buffer_allocs);
+		nulls[8] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f51c7db7855..dd9f51529b4 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -716,6 +716,17 @@ BufferPartitionGet(int idx, int *node, int *num_buffers,
 	elog(ERROR, "invalid partition index");
 }
 
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions, int *num_nodes)
+{
+	if (num_partitions)
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
 /* XXX the GUC hooks should probably be somewhere else? */
 bool
 check_debug_numa(char **newval, void **extra, GucSource source)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..121134bb94c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3580,33 +3580,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3634,25 +3630,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3664,17 +3651,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3682,11 +3669,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3706,9 +3693,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3722,15 +3709,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3738,9 +3726,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3750,7 +3738,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3758,10 +3746,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3788,7 +3776,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3813,20 +3801,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3840,7 +3828,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3870,8 +3858,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..d5f8f28f562 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,34 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +132,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +144,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +152,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +185,117 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
+ */
+static int
+calculate_partition_index(void)
+{
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -222,9 +365,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -269,6 +438,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -276,37 +485,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -343,6 +559,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -350,6 +569,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -365,6 +588,18 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -382,7 +617,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -394,15 +630,44 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
+
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
@@ -739,3 +1004,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 294188e21c5..5cce690933b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,7 +439,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -485,5 +487,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 1618caf1c2c..7d66e8276cd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -355,6 +355,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b3f0504008e..2b7ce2d0b9f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -427,6 +427,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.0

v20250918-0003-NUMA-clocksweep-allocation-balancing.patchtext/x-patch; charset=UTF-8; name=v20250918-0003-NUMA-clocksweep-allocation-balancing.patchDownload

From 1707ae15be8f0311219c4ba84b15a41d9bde6c69 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 10 Sep 2025 18:56:28 +0200
Subject: [PATCH v20250918 3/6] NUMA: clocksweep allocation balancing

If backends only allocate buffers from the "local" partition, this could
cause significant misbalance - some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 6676e807034..dc2ce019283 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -22,7 +22,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c9dfc8a1b82..f6831f60b9e 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	9
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -795,6 +797,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -826,6 +830,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -847,11 +857,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -860,8 +876,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -890,6 +914,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[9] = false;
+
+		values[10] = Int64GetDatum(buffer_req_allocs);
+		nulls[10] = false;
+
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 121134bb94c..8315105394d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3884,6 +3884,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index d5f8f28f562..349b626db3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -132,7 +168,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -144,7 +206,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -291,11 +353,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -365,7 +475,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -438,6 +548,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(LOG, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(LOG, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -464,6 +792,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -658,7 +987,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1007,8 +1350,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1016,11 +1361,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5cce690933b..1b4dae180e0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,6 +439,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7d66e8276cd..1ca0d4c375e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -356,11 +356,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.0

v20250918-0004-NUMA-weighted-clocksweep-balancing.patchtext/x-patch; charset=UTF-8; name=v20250918-0004-NUMA-weighted-clocksweep-balancing.patchDownload

From 5631cd15f07c368153014464c4ce58ab0838a317 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20250918 4/6] NUMA: weighted clocksweep balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 349b626db3b..5cf9a565914 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -605,6 +605,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -631,16 +645,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -648,8 +673,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -712,6 +743,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -719,7 +754,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -733,22 +768,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.0

v20250918-0005-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20250918-0005-NUMA-partition-PGPROC.patchDownload

From 5247bf1c147ec9593e72168fec3579910b4f262c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 8 Sep 2025 13:11:02 +0200
Subject: [PATCH v20250918 5/6] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 ++++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 +--
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 532 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 705 insertions(+), 70 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index dc2ce019283..306063e159e 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -33,3 +33,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index f6831f60b9e..a859962f5f8 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -30,6 +31,7 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -932,3 +935,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d8e2fce2c99..7745a197470 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 79708e59259..92db67fa5f9 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -107,8 +107,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1f142f20c7..011fecfc58b 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dd9f51529b4..2fd7f937ffb 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -766,6 +766,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5cf9a565914..a6657c0fc13 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -467,7 +467,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4cc7f645c31..d01f486876d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..70ccfebef55 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,32 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -76,8 +87,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -90,6 +101,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -100,11 +134,36 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+
 	return size;
 }
 
@@ -129,6 +188,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -140,12 +253,16 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
+	size = add_size(size, PGProcPartitionsShmemSize());
 
 	return size;
 }
@@ -191,7 +308,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -210,6 +327,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -224,6 +344,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +370,104 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +500,92 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
 
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
+
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
 		 * dummy PGPROCs don't need these though - they're never associated
@@ -366,9 +648,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -435,7 +714,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -646,7 +969,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1049,7 +1372,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1098,7 +1421,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1988,7 +2311,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2386,168 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ * doesn't do alignment
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	elog(DEBUG1, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2b7ce2d0b9f..912dd4ac100 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1876,6 +1876,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.0

v20250918-0006-fixup-StrategyGetBuffer.patchtext/x-patch; charset=UTF-8; name=v20250918-0006-fixup-StrategyGetBuffer.patchDownload

From e78f37f4769225522ec4ce828fbc3a3e46bcdf6e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:42:34 +0200
Subject: [PATCH v20250918 6/6] fixup: StrategyGetBuffer

StrategyGetBuffer is expected to scan all clock-sweep partitions, not
just the "local" one. So start at the optimal one (as calculated by
ChooseClockSweep), and then advance to the next one in a round-robin
way, until we find a free / unpinned buffer.
---
 src/backend/storage/buffer/freelist.c     | 94 ++++++++++++++++-------
 src/test/recovery/t/027_stream_regress.pl |  5 --
 2 files changed, 65 insertions(+), 34 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index a6657c0fc13..d911986d7ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -169,6 +169,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 static ClockSweep *ChooseClockSweep(bool balance);
+static BufferDesc *StrategyGetBufferPartition(ClockSweep *sweep,
+											  BufferAccessStrategy strategy,
+											  uint32 *buf_state);
 
 /*
  * clocksweep allocation balancing
@@ -203,10 +206,9 @@ static int clocksweep_count = 0;
  * id of the buffer now under the hand.
  */
 static inline uint32
-ClockSweepTick(void)
+ClockSweepTick(ClockSweep *sweep)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -425,8 +427,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
-	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	ClockSweep *sweep,
+			   *sweep_start;		/* starting clock-sweep partition */
 
 	*from_ring = false;
 
@@ -480,34 +482,68 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
 	 *
-	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
-	 * buffers from a single partition, aligned with the NUMA node. That means
-	 * it only accesses buffers from the same NUMA node.
-	 *
-	 * XXX That also means each process "sweeps" only a fraction of buffers,
-	 * even if the other buffers are better candidates for eviction. Maybe
-	 * there should be some logic to "steal" buffers from other freelists or
-	 * other nodes?
-	 *
-	 * XXX Would that also mean we'd have multiple bgwriters, one for each
-	 * node, or would one bgwriter handle all of that?
+	 * Start with the "preferred" partition, and then proceed in a round-robin
+	 * manner. If we cycle back to the starting partition, it means none of the
+	 * partitions has unpinned buffers.
 	 *
-	 * XXX This only searches a single partition, which can result in "no
-	 * unpinned buffers available" even if there are buffers in other
-	 * partitions. Should be fixed by falling back to other partitions if
-	 * needed.
+	 * XXX Does this need to do similar balancing "balancing" as for bgwriter
+	 * in StrategySyncBalance? Maybe it's be enough to simply pick the initial
+	 * partition that way? We'd only getting a single buffer, so not much chance
+	 * to balance over many allocations.
 	 *
-	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
-	 * count for that one partition. In fact, this should not call ClockSweepTick
-	 * for every iteration. The call is likely quite expensive (does a lot
-	 * of stuff), and also may return a different partition on each call.
-	 * We should just do it once, and then do the for(;;) loop. And then
-	 * maybe advance to the next partition, until we scan through all of them.
+	 * XXX But actually, we're calling ChooseClockSweep() with balance=true, so
+	 * maybe it already does balancing?
 	 */
-	trycounter = NBuffers;
+	sweep = ChooseClockSweep(true);
+	sweep_start = sweep;
+
 	for (;;)
 	{
-		buf = GetBufferDescriptor(ClockSweepTick());
+		buf = StrategyGetBufferPartition(sweep, strategy, buf_state);
+
+		/* found a buffer in the "sweep" partition, we're done */
+		if (buf != NULL)
+			return buf;
+
+		/*
+		 * Try advancing to the next partition, round-robin (if last partition,
+		 * wrap around to the beginning).
+		 *
+		 * XXX This is a bit ugly, there must be a better way to advance to the
+		 * next partition.
+		 */
+		if (sweep == &StrategyControl->sweeps[StrategyControl->num_partitions - 1])
+			sweep = StrategyControl->sweeps;
+		else
+			sweep++;
+
+		/* we've scanned all partitions */
+		if (sweep == sweep_start)
+			break;
+	}
+
+	/* we shouldn't get here if there are unpinned buffers */
+	elog(ERROR, "no unpinned buffers available");
+}
+
+/*
+ * StrategyGetBufferPartition
+ *		get a free buffer from a single clock-sweep partition
+ *
+ * Returns NULL if there are no free (unpinned) buffers in the partition.
+*/
+static BufferDesc *
+StrategyGetBufferPartition(ClockSweep *sweep, BufferAccessStrategy strategy,
+						   uint32 *buf_state)
+{
+	BufferDesc *buf;
+	int			trycounter;
+	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+
+	trycounter = sweep->numBuffers;
+	for (;;)
+	{
+		buf = GetBufferDescriptor(ClockSweepTick(sweep));
 
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -521,7 +557,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 			{
 				local_buf_state -= BUF_USAGECOUNT_ONE;
 
-				trycounter = NBuffers;
+				trycounter = sweep->numBuffers;
 			}
 			else
 			{
@@ -542,7 +578,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 			 * infinite loop.
 			 */
 			UnlockBufHdr(buf, local_buf_state);
-			elog(ERROR, "no unpinned buffers available");
+			return NULL;
 		}
 		UnlockBufHdr(buf, local_buf_state);
 	}
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 98b146ed4b7..589c79d97d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,11 +18,6 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
-# The default is 1MB, which is not enough with clock-sweep partitioning.
-# Increase to 32MB, so that we don't get "no unpinned buffers".
-$node_primary->append_conf('postgresql.conf',
-	'shared_buffers = 32MB');
-
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
-- 
2.51.0

#63

Alexey Makhmutov

a.makhmutov@postgrespro.ru

3 months ago

In reply to: Tomas Vondra (#62)

Re: Adding basic NUMA awareness

Hi Tomas,

Thank you very much for working on this problem and the entire line of
patches prepared! I've tried to play with these patches a little and
here are some my observations and suggestions.

In the current implementation we try to use all available NUMA nodes on
the machine, however it's often useful to limit the database only to a
set of specific nodes, so that other nodes can be used for other
processes. In my testing I was trying to use one node out of four for
the client program, so I'd liked to limit the database to the remaining
nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to
start the cluster, so the obvious choice for me was to use the
'numa_get_membind' function instead of 'numa_num_configured_nodes' to
get the list of usable nodes. However, it is much easier to work with
logical nodes in the [0; n] range inside the PG code, so I've decided to
add mapping between 'logical nodes' (0-n in PG) to a set of physical
nodes actually returned by 'numa_get_membind'. We may need to map number
in both directions, so two translation tables are allocated and filled
at the first usage of 'pg_numa' functions. It also seems to be a good
idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all
'numa_...' calls in it and this also allows us to hide this mapping in
static functions. Here is the patch, which I've used to test this idea:
https://github.com/Lerm/postgres/commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161.
This idea probably could be extended by adding some view to expose this
mapping to the user (at least for testing purposes) and allow to
explicitly override this mapping with a GUC setting. With such GUC
setting we would be able to control PG memory usage on NUMA nodes
without the need for systemd resource control or numactl parameters.

Next, I've noticed some problems related to the size alignment for
'numa_tonode_memory' call in 'pg_numa_move_to_node' function. The
documentation for the 'numa_tonode_memory' says that 'The size
argument will be rounded up to a multiple of the system page size'.
However this does not work well with huge pages as alignment is
performed for the default kernel page size (i.e. 4K in most cases). If
addr + size value (rounded to the default page size) does not cover the
entire huge page, then such invocation seems to be processed incorrectly
and allocation policy is not applied for next pages access in such
segment. At least this was the behavior I've observed on Debian 12 /
6.1.40 kernel (i.e. '/proc/<pid>/numa_maps' shows that the segment
contains pages from wrong nodes).

There are two location at which we could face such situation in current
patches. First is related to buffers partitions mapping. With current
code we basically ensure that combined size of all partitions for a
single node is aligned to (huge) page size (as size is bound to the
number of descriptors on one page), but individual partition is not
explicitly aligned to this size. So, we could get the situation in which
single page is split between adjacent partitions (e.g. 32GB buffers
split by 3 nodes). With current code we will try to map each partition
independently, which will results in unaligned calls to
'numa_tonode_memory', so resulting mapping will be incorrect. We could
either try to choose size for individual partition to align it to the
desired page size or map all the partitions for a single node using a
single 'pg_numa_move_to_node' invocation. During testing I've used the
second approach, so here is the change to implement such logic:
https://github.com/Lerm/postgres/commit/ee8b3603afd6d89e67b755dadc8e4c25ffba88be.

The second location which could expose the same problem is related to
the mapping of PGPROC arrays in 'pgproc_partition_init': here we need to
align pointer to the end PGPROC partition. There seems to be also two
additional problems with PGPROC partitioning: we need to account
additional padding pages in 'PGProcShmemSize' (using the same logic as
with fplocks) and we should not call 'MemSet(ptr, 0, ...)' prior to
partitions mapping call (otherwise it will be mapped to current node).
Here is a potential change, which tries to address these problems:
https://github.com/Lerm/postgres/commit/eaf12776f59ff150735d0f187595fc8ce3f0a872.

There are also some potential problems with buffers distribution between
nodes. I have a feeling that current logic in
'buffer_partitions_prepare' does not work correctly if number of buffers
is enough to cover just a single partition per node, but total number of
nodes is below MIN_BUFFER_PARTITIONS (i.e. 2 or 3). In this case we will
set 'numa_can_partition' to 'true', but will allocate 2 partitions per
node (so, 4 or 6 partitions in total), while we can fill just 2 or 3
partition and leaving remaining partitions empty. This should violate
the last assert check, as last partition will get zero buffers in this
case. Another issue is related to the usage of 1GB pages, as minimal
size for buffers partitions is limited by the minimal number of buffer
descriptors in a single page. For 2MB pages this gives 2097152 / 64 * 8K
= 256M as minimal size for partition, but for 1GB page the minimal size
is equal to 1GB / 64 * 8K = 128GB. So, if we assume 4 as minimal number
of partitions, then for 2MB pages we need just 1GB for shared_buffers to
enable partitioning (which seems a perfectly fine minimal limit for most
cases), but for 1GB pages we need to allocate at least 512GB to allow
buffers partitioning. Certainly, 1GB pages are usually used on large
machines with large number of buffers allocated, but still it may be
useful to allow configurations with 32GB or 64GB buffer cache to use
both 1GB pages and buffers partitioning at the same time. However, I
don't see an easy way to achieve this with the current logic. We either
need to allow usage of different page sizes here (i.e. 2MB for
descriptors and 1GB for buffers) or combine both buffers and its
descriptors in a single object (i.e. 'buffer chunk', which cover enough
buffers and their descriptors to fit into one or several memory pages),
effectively replacing both buffers and descriptors arrays with an array
of such 'chunks'. The latter solution may also help with dynamic buffer
cache resizing (as we may just add additional 'chunks' in this case) and
also increase TLB-hits with 1GB page (as both descriptor and its buffer
will be likely located in the same page). However, both these changes
seems to be quite large.

I've tried also to run some benchmarks on my server: I've got some
improvements in 'pgbench/tpcb-like'results - about 8%, but only with
backends pinning to NUMA node (i.e. adjusting your previous pinning
patch to 'debug_numa' GUC:
https://github.com/Lerm/postgres/commit/5942a3e12c7c501aa9febb63972a039e7ce00c20).
For 'select-only' scenario the gain is more substantial (about 15%), but
these tests are tricky, as they are more sensitive to other server
settings and specific functions layout in compiled code, so they need
more checks.

Thank you again for sharing these patches!

Thanks,
Alexey

#64

tomas@vondra.me

3 months ago

In reply to: Alexey Makhmutov (#63)

Re: Adding basic NUMA awareness

On 10/13/25 01:58, Alexey Makhmutov wrote:

Hi Tomas,

Thank you very much for working on this problem and the entire line of
patches prepared! I've tried to play with these patches a little and
here are some my observations and suggestions.

In the current implementation we try to use all available NUMA nodes on
the machine, however it's often useful to limit the database only to a
set of specific nodes, so that other nodes can be used for other
processes. In my testing I was trying to use one node out of four for
the client program, so I'd liked to limit the database to the remaining
nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to
start the cluster, so the obvious choice for me was to use the
'numa_get_membind' function instead of 'numa_num_configured_nodes' to
get the list of usable nodes. However, it is much easier to work with
logical nodes in the [0; n] range inside the PG code, so I've decided to
add mapping between 'logical nodes' (0-n in PG) to a set of physical
nodes actually returned by 'numa_get_membind'. We may need to map number
in both directions, so two translation tables are allocated and filled
at the first usage of 'pg_numa' functions. It also seems to be a good
idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all
'numa_...' calls in it and this also allows us to hide this mapping in
static functions. Here is the patch, which I've used to test this idea:
https://github.com/Lerm/postgres/
commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161. This idea probably
could be extended by adding some view to expose this mapping to the user
(at least for testing purposes) and allow to explicitly override this
mapping with a GUC setting. With such GUC setting we would be able to
control PG memory usage on NUMA nodes without the need for systemd
resource control or numactl parameters.

I've argued to keep this out of scope for v1, to keep it smaller and
simpler. I'm not against adding that feature, though. If someone writes
a patch to support this. I suppose the commit you linked is a step in
that direction.

I agree we should isolate libnuma calls to pg_numa.{c,h}. I wasn't quite
consistent when doing that.

Next, I've noticed some problems related to the size alignment for
'numa_tonode_memory' call in 'pg_numa_move_to_node' function. The
documentation for the 'numa_tonode_memory' says that 'The size
argument will be rounded up to a multiple of the system page size'.
However this does not work well with huge pages as alignment is
performed for the default kernel page size (i.e. 4K in most cases). If
addr + size value (rounded to the default page size) does not cover the
entire huge page, then such invocation seems to be processed incorrectly
and allocation policy is not applied for next pages access in such
segment. At least this was the behavior I've observed on Debian 12 /
6.1.40 kernel (i.e. '/proc/<pid>/numa_maps' shows that the segment
contains pages from wrong nodes).

I'm not sure I understand. Are you suggesting there's a bug in the
patch, the kernel, or somewhere else?

There's definitely a possibility of confusion with huge pages, no doubt
about that. The default "system page size" is 4KB, but we need to
process whole huge pages.

But this is exactly why (with hugepages) the code aligns everything to
huge page boundary, and sizes everything as a multiple of huge page. At
least I think so. Maybe I remember wrong?

There are two location at which we could face such situation in current
patches. First is related to buffers partitions mapping. With current
code we basically ensure that combined size of all partitions for a
single node is aligned to (huge) page size (as size is bound to the
number of descriptors on one page), but individual partition is not
explicitly aligned to this size. So, we could get the situation in which
single page is split between adjacent partitions (e.g. 32GB buffers
split by 3 nodes). With current code we will try to map each partition
independently, which will results in unaligned calls to
'numa_tonode_memory', so resulting mapping will be incorrect. We could
either try to choose size for individual partition to align it to the
desired page size or map all the partitions for a single node using a
single 'pg_numa_move_to_node' invocation. During testing I've used the
second approach, so here is the change to implement such logic: https://
github.com/Lerm/postgres/commit/ee8b3603afd6d89e67b755dadc8e4c25ffba88be.

Can you actually demonstrate this? The code does these two things:

* calculate min_node_buffers so that buffers/descriptors are a multiple
of page size (either 4K or huge page)

* align buffers and descriptors to memory page

TYPEALIGN(buffer_align, ...)

I believe this is sufficient to ensure nothing gets split / mapped
incorrectly. Maybe this fails sometimes?

The second location which could expose the same problem is related to
the mapping of PGPROC arrays in 'pgproc_partition_init': here we need to
align pointer to the end PGPROC partition. There seems to be also two
additional problems with PGPROC partitioning: we need to account
additional padding pages in 'PGProcShmemSize' (using the same logic as
with fplocks) and we should not call 'MemSet(ptr, 0, ...)' prior to
partitions mapping call (otherwise it will be mapped to current node).
Here is a potential change, which tries to address these problems:
https://github.com/Lerm/postgres/commit/
eaf12776f59ff150735d0f187595fc8ce3f0a872.

So you're saying pgproc_partition_init() should not do just this

ptr = (char *) ptr + num_procs * sizeof(PGPROC);

but align the pointer to numa_page_size too? Sounds reasonable.

Yeah, PGProcShmemSize() should have added huge pages for each partition,
just like FastPathLockShmemSize(). Seems like a bug.

I don't think the memset() is a problem. Yes, it might map it to the
current node, but so what - the numa_tonode_memory() will just move it
to the correct one.

There are also some potential problems with buffers distribution between
nodes. I have a feeling that current logic in
'buffer_partitions_prepare' does not work correctly if number of buffers
is enough to cover just a single partition per node, but total number of
nodes is below MIN_BUFFER_PARTITIONS (i.e. 2 or 3). In this case we will
set 'numa_can_partition' to 'true', but will allocate 2 partitions per
node (so, 4 or 6 partitions in total), while we can fill just 2 or 3
partition and leaving remaining partitions empty. This should violate
the last assert check, as last partition will get zero buffers in this
case. Another issue is related to the usage of 1GB pages, as minimal
size for buffers partitions is limited by the minimal number of buffer
descriptors in a single page. For 2MB pages this gives 2097152 / 64 * 8K
= 256M as minimal size for partition, but for 1GB page the minimal size
is equal to 1GB / 64 * 8K = 128GB. So, if we assume 4 as minimal number
of partitions, then for 2MB pages we need just 1GB for shared_buffers to
enable partitioning (which seems a perfectly fine minimal limit for most
cases), but for 1GB pages we need to allocate at least 512GB to allow
buffers partitioning. Certainly, 1GB pages are usually used on large
machines with large number of buffers allocated, but still it may be
useful to allow configurations with 32GB or 64GB buffer cache to use
both 1GB pages and buffers partitioning at the same time. However, I
don't see an easy way to achieve this with the current logic. We either
need to allow usage of different page sizes here (i.e. 2MB for
descriptors and 1GB for buffers) or combine both buffers and its
descriptors in a single object (i.e. 'buffer chunk', which cover enough
buffers and their descriptors to fit into one or several memory pages),
effectively replacing both buffers and descriptors arrays with an array
of such 'chunks'. The latter solution may also help with dynamic buffer
cache resizing (as we may just add additional 'chunks' in this case) and
also increase TLB-hits with 1GB page (as both descriptor and its buffer
will be likely located in the same page). However, both these changes
seems to be quite large.

I'll look at handling the case with shared_buffers being too small to
allow partitioning. There well might be a bug. We should simply disable
partitioning in such cases.

As for 1GB huge pages, I don't see a good way to support configurations
with small buffers in these cases. To me it seems acceptable to say that
if you want 1GB huge pages, you should have a lot of memory and shared
buffers large enough.

I'm not against supporting such systems, if we can come up with a good
partitioning scheme. When I tried to come up with a scheme like that, it
always came with a substantial complexity & cost. The main challenge was
that it forced splitting the array of buffer descriptors, similarly to
what the PGPROC partitioning does. And that made buffer access so much
more complex / expensive it seemed not worth it. I was worried about
impact on systems without NUMA partitioning.

I've tried also to run some benchmarks on my server: I've got some
improvements in 'pgbench/tpcb-like'results - about 8%, but only with
backends pinning to NUMA node (i.e. adjusting your previous pinning
patch to 'debug_numa' GUC: https://github.com/Lerm/postgres/
commit/5942a3e12c7c501aa9febb63972a039e7ce00c20). For 'select-only'
scenario the gain is more substantial (about 15%), but these tests are
tricky, as they are more sensitive to other server settings and specific
functions layout in compiled code, so they need more checks.

What kind of hardware was that? What/how many cpus, NUMA nodes, how much
memory, what storage?

FWIW the main purpose of these patches was not so much throughput
improvement, but making the behavior more stable / consistent.

regards

--
Tomas Vondra

#65

Alexey Makhmutov

a.makhmutov@postgrespro.ru

3 months ago

In reply to: Tomas Vondra (#64)

Re: Adding basic NUMA awareness

On 10/13/25 14:09, Tomas Vondra wrote:

I'm not sure I understand. Are you suggesting there's a bug in the

patch, the kernel, or somewhere else?

We need to ensure that both addr and (addr + size) are aligned to the
page size of the target mapping during 'numa_tonode_memory' invocation,
otherwise it may produce unexpected results.

But this is exactly why (with hugepages) the code aligns everything

to huge page boundary, and sizes everything as a multiple of huge page.
At least I think so. Maybe I remember wrong?

I assume that there are places in the current patch, which could perform
such unaligned mapping. See below for samples.

Can you actually demonstrate this?

This issue is related to the calculation of partition size for buffer
descriptors in case we have multiple partitions per node. Currently we
ensure that each node gets number of buffers, which fits into whole
memory pages, but if we have several partitions per node, then there is
no guarantee that partition size will be properly aligned for
descriptors. We could observe this problem only if we have multiple
partitions per node and with MIN_BUFFER_PARTITIONS equal to 4, this
issue can potentially affect only configurations with 2 or 3 nodes.

Two examples here: first, let's assume we want to have shared_buffers
set to 32GB with 3 NUMA nodes and 2MB pages. The NBuffers will be
4,194,304, min_node_buffers will be 32,768 and num_partitions_per_node
will be 2 (so, 6 partitions in total). NBuffers/min_node_buffers = 128,
so the nearest multiplier for min_node_buffers which allow us to cover
all buffers with 3 nodes is 43 (42*3 = 126, 43*3 = 129). The
num_buffers_per_node is 43*min_node_buffers and it is aligned to page
size, but we need to split it between two partitions, so each gets
41.5*min_node_buffers buffers. This still allow us to split buffers
itself by page boundary, but descriptor partitions will be split just in
the middle of the page. Here is the log for such configuration:
NUMA: buffers 4194304 partitions 6 num_nodes 3 per_node 2
buffers_per_node 1409024 (min 32768)
NUMA: buffer 0 node 0 partition 0 buffers 704512 first 0 last 704511
NUMA: buffer 1 node 0 partition 1 buffers 704512 first 704512 last 1409023
NUMA: buffer 2 node 1 partition 0 buffers 704512 first 1409024 last 2113535
NUMA: buffer 3 node 1 partition 1 buffers 704512 first 2113536 last 2818047
NUMA: buffer 4 node 2 partition 0 buffers 688128 first 2818048 last 3506175
NUMA: buffer 5 node 2 partition 1 buffers 688128 first 3506176 last 4194303
NUMA: buffer_partitions_init: 0 => 0 buffers 704512 start 0x7ff7c8c00000
end 0x7ff920c00000 (size 5771362304)
NUMA: buffer_partitions_init: 0 => 0 descriptors 704512 start
0x7ff7b8a00000 end 0x7ff7bb500000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 1 => 0 buffers 704512 start 0x7ff920c00000
end 0x7ffa78c00000 (size 5771362304)
NUMA: buffer_partitions_init: 1 => 0 descriptors 704512 start
0x7ff7bb500000 end 0x7ff7be000000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 2 => 1 buffers 704512 start 0x7ffa78c00000
end 0x7ffbd0c00000 (size 5771362304)
NUMA: buffer_partitions_init: 2 => 1 descriptors 704512 start
0x7ff7be000000 end 0x7ff7c0b00000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 704512 start 0x7ffbd0c00000
end 0x7ffd28c00000 (size 5771362304)
NUMA: buffer_partitions_init: 3 => 1 descriptors 704512 start
0x7ff7c0b00000 end 0x7ff7c3600000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 4 => 2 buffers 688128 start 0x7ffd28c00000
end 0x7ffe78c00000 (size 5637144576)
NUMA: buffer_partitions_init: 4 => 2 descriptors 688128 start
0x7ff7c3600000 end 0x7ff7c6000000 (size 44040192)
NUMA: buffer_partitions_init: 5 => 2 buffers 688128 start 0x7ffe78c00000
end 0x7fffc8c00000 (size 5637144576)
NUMA: buffer_partitions_init: 5 => 2 descriptors 688128 start
0x7ff7c6000000 end 0x7ff7c8a00000 (size 44040192)

Another example: 2 nodes and 15872MB shared_buffers. Again,
NBuffers/min_node_buffers=62, so num_buffers_per_node is
31*min_node_buffers, which gives each partition 15.5*min_node_buffers.
Here is the log output:
NUMA: buffers 2031616 partitions 4 num_nodes 2 per_node 2
buffers_per_node 1015808 (min 32768)
NUMA: buffer 0 node 0 partition 0 buffers 507904 first 0 last 507903
NUMA: buffer 1 node 0 partition 1 buffers 507904 first 507904 last 1015807
NUMA: buffer 2 node 1 partition 0 buffers 507904 first 1015808 last 1523711
NUMA: buffer 3 node 1 partition 1 buffers 507904 first 1523712 last 2031615
NUMA: buffer_partitions_init: 0 => 0 buffers 507904 start 0x7ffbf9c00000
end 0x7ffcf1c00000 (size 4160749568)
NUMA: buffer_partitions_init: 0 => 0 descriptors 507904 start
0x7ffbf1e00000 end 0x7ffbf3d00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 1 => 0 buffers 507904 start 0x7ffcf1c00000
end 0x7ffde9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 1 => 0 descriptors 507904 start
0x7ffbf3d00000 end 0x7ffbf5c00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 2 => 1 buffers 507904 start 0x7ffde9c00000
end 0x7ffee1c00000 (size 4160749568)
NUMA: buffer_partitions_init: 2 => 1 descriptors 507904 start
0x7ffbf5c00000 end 0x7ffbf7b00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 507904 start 0x7ffee1c00000
end 0x7fffd9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 3 => 1 descriptors 507904 start
0x7ffbf7b00000 end 0x7ffbf9a00000 (size 32505856)
mbind: Invalid argument

So you're saying pgproc_partition_init() should not do just this
ptr = (char *) ptr + num_procs * sizeof(PGPROC);
but align the pointer to numa_page_size too? Sounds reasonable.

Yes, that's exactly my point, otherwise we could violate the alignment
rule for 'numa_tonode_memory'. Here is an extraction from the log for
system with 2 nodes, 2000 max_connections and 2MB pages:
NUMA: pgproc backends 2056 num_nodes 2 per_node 1028
NUMA: pgproc_init_partition procs 0x7fffe7800000 endptr 0x7fffe78d2d20
num_procs 1028 node 0
mbind: Invalid argument
NUMA: pgproc_init_partition procs 0x7fffe7a00000 endptr 0x7fffe7ad2d20
num_procs 1028 node 1
mbind: Invalid argument
NUMA: pgproc_init_partition procs 0x7fffe7c00000 endptr 0x7fffe7c07cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument

I don't think the memset() is a problem. Yes, it might map it to the

current node, but so what - the numa_tonode_memory() will just move it
to the correct one.

Well, the 'numa_tonode_memory' call does not move pages to the target
node. It just sets the policy for mapping, so system will actually try
to provide page from the correct node once we touch it. However, if the
page is already faulted, then it won't be affected by this mapping, so
that's why it works faster compared to 'numa_move_pages'. As stated in
libnuma documentation:
* numa_tonode_memory() put memory on a specific node. The constraints
described for numa_interleave_memory() apply here too.
* numa_interleave_memory() interleaves size bytes of memory page by
page from start on nodes specified in nodemask. <...> This is a lower
level function to interleave allocated but not yet faulted in memory.
Not yet faulted in means the memory is allocated using mmap(2) or
shmat(2), but has not been accessed by the current process yet. <...>
If the numa_set_strict() flag is true then the operation will cause a
numa_error if there were already pages in the mapping that do not follow
the policy.

I assume, that for the regular page kernel may rebalance memory in the
future (not immediately), but not for hugepages. So, we really don't
want to touch the memory area before we call the 'numa_tonode_memory'.

This can be easily tested with the simple program:
#include <stdio.h>
#include <numa.h>
#include <sys/mman.h>
#include <linux/mman.h>

#define MAP_SIZE 2*1024*1024

int main(int argc, char** argv) {
void* ptr1 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);
void* ptr2 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);

/* Fault first page */
memset(ptr1, 1, MAP_SIZE);
/* Move to node 1 */
numa_tonode_memory(ptr1, MAP_SIZE, 1);
numa_tonode_memory(ptr2, MAP_SIZE, 1);
/* Fault second page */
memset(ptr2, 1, MAP_SIZE);

/* Wait */
printf("ptr1=%lx\nptr2=\%lx\nPress Enter to continue...\n",ptr1,ptr2);
getchar();
munmap(ptr2, MAP_SIZE);
munmap(ptr1, MAP_SIZE);
return 0;
}

Running it on the first node:
# gcc -o test_mem test_mem.c -lnuma
# taskset -c 0 ./test_mem
ptr1=7ffff7a00000
ptr2=7ffff7800000
Press Enter to continue...

From another terminal:
# grep huge /proc/`pgrep test_mem`/numa_maps
7ffff7800000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N1=1
kernelpagesize_kB=2048
7ffff7a00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N0=1
kernelpagesize_kB=2048

So, while policy (bind:1) is set for both mappings, but only the second
one (which was not touched before the 'numa_tonode_memory' invocation)
is actualy located on node 1 rather than 0.

What kind of hardware was that? What/how many cpus, NUMA nodes, how

much memory, what storage?

Of course, that's valid question. I probably should not have commented
on performance side without providing full data, while I was still
trying to measure it and it was just preliminary runs. Sorry for that.

Thanks,
Alexey

#66

tomas@vondra.me

3 months ago

In reply to: Alexey Makhmutov (#65)

12 attachment(s)

Re: Adding basic NUMA awareness

Hi,

Here's an updated patch series, addressing (some) of the issues. I've
kept the changes in separate patches, to make the changes easier to
review and discuss.

On 10/13/25 20:34, Alexey Makhmutov wrote:

On 10/13/25 14:09, Tomas Vondra wrote:

I'm not sure I understand. Are you suggesting there's a bug in the

patch, the kernel, or somewhere else?

We need to ensure that both addr and (addr + size) are aligned to the
page size of the target mapping during 'numa_tonode_memory' invocation,
otherwise it may produce unexpected results.

Hmm. The libnuma docs about numa_intereave_memory says:

... The size argument will be rounded up to a multiple of the system
page size. ...

Which I interpreted that it does all the necessary rounding. But if this
ignores huge pages (i.e. "system page size" is 4K, not a HP size), then
aligning the size explicitly is would be needed.

This would be pretty annoying, though. It'd mean we can't rely on any
rounding done by libnuma, at least for code that might use huge pages.

Nevertheless, the updated patches should address both cases ...

But this is exactly why (with hugepages) the code aligns everything to

huge page boundary, and sizes everything as a multiple of huge page. At
least I think so. Maybe I remember wrong?

I assume that there are places in the current patch, which could perform
such unaligned mapping. See below for samples.

Can you actually demonstrate this?

This issue is related to the calculation of partition size for buffer
descriptors in case we have multiple partitions per node. Currently we
ensure that each node gets number of buffers, which fits into whole
memory pages, but if we have several partitions per node, then there is
no guarantee that partition size will be properly aligned for
descriptors. We could observe this problem only if we have multiple
partitions per node and with MIN_BUFFER_PARTITIONS equal to 4, this
issue can potentially affect only configurations with 2 or 3 nodes.

Two examples here: first, let's assume we want to have shared_buffers
set to 32GB with 3 NUMA nodes and 2MB pages. The NBuffers will be
4,194,304, min_node_buffers will be 32,768 and num_partitions_per_node
will be 2 (so, 6 partitions in total). NBuffers/min_node_buffers = 128,
so the nearest multiplier for min_node_buffers which allow us to cover
all buffers with 3 nodes is 43 (42*3 = 126, 43*3 = 129). The
num_buffers_per_node is 43*min_node_buffers and it is aligned to page
size, but we need to split it between two partitions, so each gets
41.5*min_node_buffers buffers. This still allow us to split buffers
itself by page boundary, but descriptor partitions will be split just in
the middle of the page. Here is the log for such configuration:
NUMA: buffers 4194304 partitions 6 num_nodes 3 per_node 2
buffers_per_node 1409024 (min 32768)
...

I see. I was really puzzled how could a node get chunk of buffers that's
not a multiple of page size, because min_node_buffers was meant to
guarantee that. But now I realize it's not about per-node buffers, it's
about individual partitions.

Initially I thought the right way to fix this is to use min_node_buffers
for each partitions, not for nodes. But that would increase the amount
of memory needed for NUMA partitioning to work. I practice that wouldn't
be an issue, because it'd still be only ~1GB (with 2MB huge pages), and
the relevant systems will have way more.

But then I realized it's we don't need to map the partitions one by one.
We can simply map all partitions for the whole NUMA node at once, and
then we don't have this problem at all.

The attached 0007 patch does this to fix the issue. And I just noticed
this is pretty much exactly how you fixed this in your commit ee8b360.

The last partition may still not have the size aligned, though, because
may not be a multiple of min_node_buffers.

Another example: 2 nodes and 15872MB shared_buffers. Again, NBuffers/
min_node_buffers=62, so num_buffers_per_node is 31*min_node_buffers,
which gives each partition 15.5*min_node_buffers. Here is the log output:
NUMA: buffers 2031616 partitions 4 num_nodes 2 per_node 2
buffers_per_node 1015808 (min 32768)
...
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 507904 start 0x7ffee1c00000
end 0x7fffd9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 3 => 1 descriptors 507904 start
0x7ffbf7b00000 end 0x7ffbf9a00000 (size 32505856)
mbind: Invalid argument

So you're saying pgproc_partition_init() should not do just this
ptr = (char *) ptr + num_procs * sizeof(PGPROC);
but align the pointer to numa_page_size too? Sounds reasonable.

Yes, that's exactly my point, otherwise we could violate the alignment
rule for 'numa_tonode_memory'. Here is an extraction from the log for
system with 2 nodes, 2000 max_connections and 2MB pages:

Should be fixed by 0010 by explicitly aligning the size like this. It's
a bit more extensive than your eaf1277.

BTW what's the mbind failures about? Is that something we check, at
least in memory

I don't think the memset() is a problem. Yes, it might map it to the

current node, but so what - the numa_tonode_memory() will just move it
to the correct one.

Well, the 'numa_tonode_memory' call does not move pages to the target
node. It just sets the policy for mapping, so system will actually try
to provide page from the correct node once we touch it. However, if the
page is already faulted, then it won't be affected by this mapping, so
that's why it works faster compared to 'numa_move_pages'. As stated in
libnuma documentation:
* numa_tonode_memory() put memory on a specific node. The constraints
described for numa_interleave_memory() apply here too.
* numa_interleave_memory() interleaves size bytes of memory page by
page from start on nodes specified in nodemask. <...> This is a lower
level function to interleave allocated but not yet faulted in memory.
Not yet faulted in means the memory is allocated using mmap(2) or
shmat(2), but has not been accessed by the current process yet. <...>
If the numa_set_strict() flag is true then the operation will cause a
numa_error if there were already pages in the mapping that do not follow
the policy.

Point taken. The 0009 fixes this by moving the MemSet() to after the
partitioning. At that point the policy is already set.

There's a couple more fixes. 0008 improves handling of cases that don't
allow NUMA partitioning (like when shared_buffers are too small). 0011
adds the missing padding to PGProcShmemSize, which you also fixed in one
of your commits.

0012 reduces logging in clock-sweep balancing, which on idle systems was
annoyingly verbose.

I keps 0006 separate for now. It got broken by 5e89985928, and the
conflicts were fairly extensive. Better keep it separate a bit longer.

regards

--
Tomas Vondra

Attachments:

v20251015-0006-fix-StrategyGetBuffer.patchtext/x-patch; charset=UTF-8; name=v20251015-0006-fix-StrategyGetBuffer.patchDownload

From 2fd75d9fcd58cab66cf75b1b39eb3b477ba9b311 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 13:59:29 +0200
Subject: [PATCH v20251015 06/12] fix: StrategyGetBuffer

StrategyGetBuffer is expected to scan all clock-sweep partitions, not
just the "local" one. So start at the optimal one (as calculated by
ChooseClockSweep), and then advance to the next one in a round-robin
way, until we find a free / unpinned buffer.
---
 src/backend/storage/buffer/freelist.c     | 91 ++++++++++++++++-------
 src/test/recovery/t/027_stream_regress.pl |  5 --
 2 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 314ccbe4f93..7f241301dcb 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -169,6 +169,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 static ClockSweep *ChooseClockSweep(bool balance);
+static BufferDesc *StrategyGetBufferPartition(ClockSweep *sweep,
+											  BufferAccessStrategy strategy,
+											  uint32 *buf_state);
 
 /*
  * clocksweep allocation balancing
@@ -203,10 +206,9 @@ static int clocksweep_count = 0;
  * id of the buffer now under the hand.
  */
 static inline uint32
-ClockSweepTick(void)
+ClockSweepTick(ClockSweep *sweep)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -428,7 +430,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
+	ClockSweep *sweep,
+			   *sweep_start;		/* starting clock-sweep partition */
 
 	*from_ring = false;
 
@@ -482,37 +485,69 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
 	 *
-	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
-	 * buffers from a single partition, aligned with the NUMA node. That means
-	 * it only accesses buffers from the same NUMA node.
-	 *
-	 * XXX That also means each process "sweeps" only a fraction of buffers,
-	 * even if the other buffers are better candidates for eviction. Maybe
-	 * there should be some logic to "steal" buffers from other freelists or
-	 * other nodes?
+	 * Start with the "preferred" partition, and then proceed in a round-robin
+	 * manner. If we cycle back to the starting partition, it means none of the
+	 * partitions has unpinned buffers.
 	 *
-	 * XXX Would that also mean we'd have multiple bgwriters, one for each
-	 * node, or would one bgwriter handle all of that?
+	 * XXX Does this need to do similar balancing "balancing" as for bgwriter
+	 * in StrategySyncBalance? Maybe it's be enough to simply pick the initial
+	 * partition that way? We'd only getting a single buffer, so not much chance
+	 * to balance over many allocations.
 	 *
-	 * XXX This only searches a single partition, which can result in "no
-	 * unpinned buffers available" even if there are buffers in other
-	 * partitions. Should be fixed by falling back to other partitions if
-	 * needed.
-	 *
-	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
-	 * count for that one partition. In fact, this should not call ClockSweepTick
-	 * for every iteration. The call is likely quite expensive (does a lot
-	 * of stuff), and also may return a different partition on each call.
-	 * We should just do it once, and then do the for(;;) loop. And then
-	 * maybe advance to the next partition, until we scan through all of them.
+	 * XXX But actually, we're calling ChooseClockSweep() with balance=true, so
+	 * maybe it already does balancing?
 	 */
-	trycounter = NBuffers;
+	sweep = ChooseClockSweep(true);
+	sweep_start = sweep;
+	for (;;)
+	{
+		buf = StrategyGetBufferPartition(sweep, strategy, buf_state);
+
+		/* found a buffer in the "sweep" partition, we're done */
+		if (buf != NULL)
+			return buf;
+
+		/*
+		 * Try advancing to the next partition, round-robin (if last partition,
+		 * wrap around to the beginning).
+		 *
+		 * XXX This is a bit ugly, there must be a better way to advance to the
+		 * next partition.
+		 */
+		if (sweep == &StrategyControl->sweeps[StrategyControl->num_partitions - 1])
+			sweep = StrategyControl->sweeps;
+		else
+			sweep++;
+
+		/* we've scanned all partitions */
+		if (sweep == sweep_start)
+			break;
+	}
+
+	/* we shouldn't get here if there are unpinned buffers */
+	elog(ERROR, "no unpinned buffers available");
+}
+
+/*
+ * StrategyGetBufferPartition
+ *		get a free buffer from a single clock-sweep partition
+ *
+ * Returns NULL if there are no free (unpinned) buffers in the partition.
+*/
+static BufferDesc *
+StrategyGetBufferPartition(ClockSweep *sweep, BufferAccessStrategy strategy,
+						   uint32 *buf_state)
+{
+	BufferDesc *buf;
+	int			trycounter;
+
+	trycounter = sweep->numBuffers;
 	for (;;)
 	{
 		uint32		old_buf_state;
 		uint32		local_buf_state;
 
-		buf = GetBufferDescriptor(ClockSweepTick());
+		buf = GetBufferDescriptor(ClockSweepTick(sweep));
 
 		/*
 		 * Check whether the buffer can be used and pin it if so. Do this
@@ -540,7 +575,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 					 * one eventually, but it's probably better to fail than
 					 * to risk getting stuck in an infinite loop.
 					 */
-					elog(ERROR, "no unpinned buffers available");
+					return NULL;
 				}
 				break;
 			}
@@ -558,7 +593,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = sweep->numBuffers;
 					break;
 				}
 			}
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 98b146ed4b7..589c79d97d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,11 +18,6 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
-# The default is 1MB, which is not enough with clock-sweep partitioning.
-# Increase to 32MB, so that we don't get "no unpinned buffers".
-$node_primary->append_conf('postgresql.conf',
-	'shared_buffers = 32MB');
-
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
-- 
2.51.0

v20251015-0007-fix-map-all-buffer-partitions-for-NUMA-nod.patchtext/x-patch; charset=UTF-8; name=v20251015-0007-fix-map-all-buffer-partitions-for-NUMA-nod.patchDownload

From d6bc867374a1195d2426de629b0ffd1f22b040a9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 15:03:48 +0200
Subject: [PATCH v20251015 07/12] fix: map all buffer partitions for NUMA node
 at once

Don't map individual partitions, but all partitions for the whole NUMA
node at once. This means we don't need to worry about memory pages that
span two partitions (min_node_buffers applies to the whole node). That
might be causing problems with huge pages, as reported by Alexey
Makhmutov.

Discussion: bf95094a-77c2-46cf-913a-443f7419bc79@postgrespro.ru
---
 src/backend/storage/buffer/buf_init.c | 42 ++++++++++++++++++---------
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 2fd7f937ffb..65f45346bd1 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -460,7 +460,7 @@ buffer_partitions_prepare(void)
 	numa_can_partition = true;	/* assume we can allocate to nodes */
 	if (numa_nodes > max_nodes)
 	{
-		elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+		elog(LOG, "shared buffers too small for %d nodes (max nodes %d)",
 			 numa_nodes, max_nodes);
 		numa_can_partition = false;
 	}
@@ -630,8 +630,8 @@ buffer_partitions_init(void)
 	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
 	 * NUMA nodes. And we don't want that.
 	 *
-	 * But even with huge pages it seems like a good idea to not have mapping
-	 * for each page.
+	 * But even with huge pages it seems like a good idea to not map pages
+	 * one by one.
 	 *
 	 * So we always assign a larger contiguous chunk of buffers to the same
 	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
@@ -649,35 +649,51 @@ buffer_partitions_init(void)
 	 * We need to account for partitions being of different length, when the
 	 * NBuffers is not nicely divisible. To do that we keep track of the start
 	 * of the next partition.
+	 *
+	 * We always map all partitions for the same node at once, so that we
+	 * don't need to worry about alignment of memory pages that get split
+	 * between partitions (we only worry about min_node_buffers for whole
+	 * NUMA nodes, not for individual partitions).
 	 */
 	buffers_ptr = BufferBlocks;
 	descriptors_ptr = (char *) BufferDescriptors;
 
-	for (int i = 0; i < numa_partitions; i++)
+	for (int n = 0; n < numa_nodes; n++)
 	{
-		BufferPartition *part = &BufferPartitionsArray->partitions[i];
 		char	   *startptr,
 				   *endptr;
+		int			num_buffers = 0;
+
+		/* sum buffers in all partitions for this node */
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int		pidx = (n * parts_per_node + p);
+			BufferPartition *part = &BufferPartitionsArray->partitions[pidx];
+
+			Assert(part->numa_node == n);
+
+			num_buffers += part->num_buffers;
+		}
 
 		/* first map buffers */
 		startptr = buffers_ptr;
-		endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+		endptr = startptr + ((Size) num_buffers * BLCKSZ);
 		buffers_ptr = endptr;	/* start of the next partition */
 
-		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %zd)",
-			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => buffers %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
 
-		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+		pg_numa_move_to_node(startptr, endptr, n);
 
 		/* now do the same for buffer descriptors */
 		startptr = descriptors_ptr;
-		endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+		endptr = startptr + ((Size) num_buffers * sizeof(BufferDescPadded));
 		descriptors_ptr = endptr;
 
-		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %zd)",
-			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => descriptors %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
 
-		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+		pg_numa_move_to_node(startptr, endptr, n);
 	}
 
 	/* we should have consumed the arrays exactly */
-- 
2.51.0

v20251015-0008-fix-handle-disabled-NUMA-partitioning-of-b.patchtext/x-patch; charset=UTF-8; name=v20251015-0008-fix-handle-disabled-NUMA-partitioning-of-b.patchDownload

From 963e48dadc3ecc6941ee8e70d0f6d84d13ceb276 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 17:42:23 +0200
Subject: [PATCH v20251015 08/12] fix: handle disabled NUMA-partitioning of
 buffers

We still partition the buffers, because we want to partition the
clock-sweep (and that relies on partitions). But we don't map the
partitions to NUMA nodes.

The NUMA partitioning may be disabled for a number of reasons. The build
may not have libnuma enabled, debug_numa may not include "buffers"
and/or the shared buffers are too small (especially with huge pages).
---
 src/backend/storage/buffer/buf_init.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 65f45346bd1..d0efa102d82 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -301,6 +301,10 @@ BufferGetNode(Buffer buffer)
 	if (numa_buffers_per_node == -1)
 		return 0;
 
+	/* no NUMA-aware partitioning */
+	if ((numa_flags & NUMA_BUFFERS) == 0)
+		return 0;
+
 	return (buffer / numa_buffers_per_node);
 }
 
@@ -460,10 +464,15 @@ buffer_partitions_prepare(void)
 	numa_can_partition = true;	/* assume we can allocate to nodes */
 	if (numa_nodes > max_nodes)
 	{
-		elog(LOG, "shared buffers too small for %d nodes (max nodes %d)",
+		elog(NOTICE, "shared buffers too small for %d nodes (max nodes %d)",
 			 numa_nodes, max_nodes);
 		numa_can_partition = false;
 	}
+	else if ((numa_flags & NUMA_BUFFERS) == 0)
+	{
+		elog(NOTICE, "NUMA-partitioning of buffers disabled");
+		numa_can_partition = false;
+	}
 
 	/*
 	 * We know we can partition to the desired number of nodes, now it's time
@@ -483,8 +492,16 @@ buffer_partitions_prepare(void)
 
 	/*
 	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
-	 * If we have only a single node, or can't map to that many nodes, just
-	 * take a "fair share" of buffers.
+	 * If we have only a single node, or when we can't partition for some
+	 * reason, just take a "fair share" of buffers. This can happen for a
+	 * number of reasons - missing NUMA support, partitioning of buffers not
+	 * enabled, or not enough buffers for this many nodes.
+	 *
+	 * We still build partitions, because we want to allow partitioning of
+	 * the clock-sweep later.
+	 *
+	 * The number of buffers for each partition is calculated later, once we
+	 * have allocated the shared memory (because that's where we store it).
 	 *
 	 * XXX In both cases the last node can get fewer buffers.
 	 */
@@ -599,7 +616,7 @@ buffer_partitions_init(void)
 			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
 
 			/* XXX we should get the actual node ID from the mask */
-			if ((numa_flags & NUMA_BUFFERS) != 0)
+			if (numa_can_partition)
 				part->numa_node = n;
 			else
 				part->numa_node = -1;
-- 
2.51.0

v20251015-0001-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251015-0001-NUMA-shared-buffers-partitioning.patchDownload

From 81749da92eb8e66fce0f3a94059b9a0dafa8bca6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:04:29 +0200
Subject: [PATCH v20251015 01/12] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

This allows partitioning clock-sweep in a later patch, with one clock
hand per partition.

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, NUMA node, etc. This serves as a source
of truth, both for this patch and for later patches building on this
same buffer partition structure.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.
---
 contrib/pg_buffercache/Makefile               |   2 +-
 contrib/pg_buffercache/meson.build            |   1 +
 .../pg_buffercache--1.6--1.7.sql              |  26 +
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  92 +++
 src/backend/storage/buffer/buf_init.c         | 627 +++++++++++++++++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  16 +-
 src/include/storage/bufmgr.h                  |  23 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 src/tools/pgindent/typedefs.list              |   2 +
 14 files changed, 859 insertions(+), 16 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 7cd039a1df9..7c31141881f 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -24,6 +24,7 @@ install_data(
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache--1.5--1.6.sql',
+  'pg_buffercache--1.6--1.7.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..fb9003c011e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,26 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3df04c98959..8a0a4bd5cd6 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -777,3 +779,93 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &numa_node, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_buffers);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(first_buffer);
+		nulls[3] = false;
+
+		values[4] = Int32GetDatum(last_buffer);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..f51c7db7855 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,20 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +35,23 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_prepare(void);
+static void buffer_partitions_init(void);
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
 
 /*
  * Data Structures:
@@ -70,19 +98,87 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
-	/* Align descriptors to a cacheline boundary. */
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), numa_partitions),
+						&foundParts);
+
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +208,12 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -148,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -175,5 +284,505 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
+
 	return size;
 }
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or can't map to that many nodes, just
+	 * take a "fair share" of buffers.
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
+
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
+
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
+
+		remaining_buffers -= node_buffers;
+
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
+
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+
+			/* XXX we should get the actual node ID from the mask */
+			if ((numa_flags & NUMA_BUFFERS) != 0)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
+	}
+
+	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not have mapping
+	 * for each page.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int i = 0; i < numa_partitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+		char	   *startptr,
+				   *endptr;
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %zd)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %zd)",
+			 i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, part->numa_node);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
+void
+BufferPartitionGet(int idx, int *node, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*node = part->numa_node;
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
+}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index b176d5130e4..82acead2e47 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -906,6 +906,16 @@
   boot_val => 'true',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'sync_replication_slots', type => 'bool', context => 'PGC_SIGHUP', group => 'REPLICATION_STANDBY',
   short_desc => 'Enables a physical standby to synchronize logical failover replication slots from the primary server.',
   variable => 'sync_replication_slots',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..d17ee9ca861 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c1206a46aba..80b9d8d012d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -275,10 +275,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -288,7 +288,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -321,6 +321,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -484,4 +485,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3f37b294af6..43ec9f4cf0f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -146,6 +146,28 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			numa_node;		/* NUMA node (-1 no node) */
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* to avoid having to expose buf_internals.h here */
 typedef struct WritebackContext WritebackContext;
 
@@ -320,6 +342,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..8ee0e7d211c 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -106,6 +109,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -128,4 +161,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
+#endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
 #endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..b02d117c374 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.0

v20251015-0002-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251015-0002-NUMA-clockweep-partitioning.patchDownload

From dc30d0de0c048e3ca572614c2278ce57158a4a1a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v20251015 02/12] NUMA: clockweep partitioning

Similar to the frelist patch - partition the "clocksweep" algorithm to
work on the sequence of smaller partitions, one by one.

It extends the "pg_buffercache_partitions" view to include information
about the clocksweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  32 +-
 src/backend/storage/buffer/buf_init.c         |  11 +
 src/backend/storage/buffer/bufmgr.c           | 186 +++++----
 src/backend/storage/buffer/freelist.c         | 353 ++++++++++++++++--
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 503 insertions(+), 103 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index fb9003c011e..6676e807034 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -16,7 +16,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8a0a4bd5cd6..c9dfc8a1b82 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	5
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	9
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -818,6 +818,14 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -839,12 +847,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -860,6 +878,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
+		values[5] = Int64GetDatum(complete_passes);
+		nulls[5] = false;
+
+		values[6] = Int32GetDatum(next_victim_buffer);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_total_allocs);
+		nulls[7] = false;
+
+		values[8] = Int64GetDatum(buffer_allocs);
+		nulls[8] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f51c7db7855..dd9f51529b4 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -716,6 +716,17 @@ BufferPartitionGet(int idx, int *node, int *num_buffers,
 	elog(ERROR, "invalid partition index");
 }
 
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions, int *num_nodes)
+{
+	if (num_partitions)
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
 /* XXX the GUC hooks should probably be somewhere else? */
 bool
 check_debug_numa(char **newval, void **extra, GucSource source)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index edf17ce3ea1..aa209eddaab 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3614,33 +3614,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3668,25 +3664,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3698,17 +3685,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3716,11 +3703,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3740,9 +3727,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3756,15 +3743,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3772,9 +3760,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3784,7 +3772,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3792,10 +3780,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3822,7 +3810,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3847,20 +3835,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3874,7 +3862,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3904,8 +3892,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7fe34d3ef4c..952440fd9e5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,34 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +132,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +144,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +152,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +185,117 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
+ */
+static int
+calculate_partition_index(void)
+{
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -224,9 +367,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -306,6 +475,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -313,37 +522,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -380,6 +596,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -387,6 +606,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -402,6 +625,18 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
+	int			num_partitions;
+	int			num_partitions_per_node;
+
+	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -419,7 +654,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -431,15 +667,44 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			node,
+						num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
+
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
@@ -802,3 +1067,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 80b9d8d012d..67f2afc623d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -443,7 +443,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -489,5 +491,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 43ec9f4cf0f..67f07d10e48 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -356,6 +356,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b02d117c374..33a9721cef3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.0

v20251015-0003-NUMA-clocksweep-allocation-balancing.patchtext/x-patch; charset=UTF-8; name=v20251015-0003-NUMA-clocksweep-allocation-balancing.patchDownload

From 38072e13f35662a953f9439c8093a7c72937564d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 10 Sep 2025 18:56:28 +0200
Subject: [PATCH v20251015 03/12] NUMA: clocksweep allocation balancing

If backends only allocate buffers from the "local" partition, this could
cause significant misbalance - some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 6676e807034..dc2ce019283 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -22,7 +22,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c9dfc8a1b82..f6831f60b9e 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	9
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -795,6 +797,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -826,6 +830,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -847,11 +857,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -860,8 +876,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -890,6 +914,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[9] = false;
+
+		values[10] = Int64GetDatum(buffer_req_allocs);
+		nulls[10] = false;
+
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aa209eddaab..13258200526 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3918,6 +3918,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 952440fd9e5..73e57e13ba1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -132,7 +168,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -144,7 +206,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -291,11 +353,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -367,7 +477,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -475,6 +585,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(LOG, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(LOG, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -501,6 +829,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -695,7 +1024,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1070,8 +1413,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1079,11 +1424,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 67f2afc623d..4ae0c32dfe5 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -443,6 +443,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 67f07d10e48..1971f0d9731 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -357,11 +357,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.0

v20251015-0004-NUMA-weighted-clocksweep-balancing.patchtext/x-patch; charset=UTF-8; name=v20251015-0004-NUMA-weighted-clocksweep-balancing.patchDownload

From 366202b656fb63400e0a1ccde29137ddab846198 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20251015 04/12] NUMA: weighted clocksweep balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 73e57e13ba1..2b4c853745d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -642,6 +642,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -668,16 +682,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -685,8 +710,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -749,6 +780,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -756,7 +791,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -770,22 +805,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.0

v20251015-0005-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20251015-0005-NUMA-partition-PGPROC.patchDownload

From 8ac90c354e3305e7ce714077bf8c5ac179a5013c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 8 Sep 2025 13:11:02 +0200
Subject: [PATCH v20251015 05/12] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 ++++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 +--
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 532 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 705 insertions(+), 70 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index dc2ce019283..306063e159e 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -33,3 +33,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index f6831f60b9e..a859962f5f8 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -30,6 +31,7 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -932,3 +935,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 33369fbe23a..afa74466006 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 976638a58ac..5e7b0ac8850 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -107,8 +107,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1f142f20c7..011fecfc58b 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dd9f51529b4..2fd7f937ffb 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -766,6 +766,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 2b4c853745d..314ccbe4f93 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -469,7 +469,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4cc7f645c31..d01f486876d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..70ccfebef55 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,32 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -76,8 +87,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -90,6 +101,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -100,11 +134,36 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+
 	return size;
 }
 
@@ -129,6 +188,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -140,12 +253,16 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
+	size = add_size(size, PGProcPartitionsShmemSize());
 
 	return size;
 }
@@ -191,7 +308,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -210,6 +327,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -224,6 +344,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +370,104 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +500,92 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
 
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
+
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
 		 * dummy PGPROCs don't need these though - they're never associated
@@ -366,9 +648,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -435,7 +714,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -646,7 +969,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1049,7 +1372,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1098,7 +1421,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1988,7 +2311,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2386,168 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ * doesn't do alignment
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	elog(DEBUG1, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 33a9721cef3..8b0d091c6c0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1878,6 +1878,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.0

v20251015-0009-fix-move-memset-after-PGPROC-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251015-0009-fix-move-memset-after-PGPROC-partitioning.patchDownload

From 30a104a5b3e9567f2410b75d5fefb57d2144e606 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 16:33:55 +0200
Subject: [PATCH v20251015 09/12] fix: move memset after PGPROC partitioning

The memset faults the pages into memory, which interferes with the NUMA
(see e.g. the requirements for numa_interleave_memory). Reported by
Alexey Makhmutov.

Discussion: bf95094a-77c2-46cf-913a-443f7419bc79@postgrespro.ru
---
 src/backend/storage/lmgr/proc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 70ccfebef55..56812a05860 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -368,8 +368,6 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
-	MemSet(ptr, 0, requestSize);
-
 	/* allprocs (array of pointers to PGPROC entries) */
 	procs = (PGPROC **) ptr;
 	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
@@ -458,6 +456,12 @@ InitProcGlobal(void)
 		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 	}
 
+	/*
+	 * Don't memset the memory before locating it to NUMA nodes (which requires
+	 * the pages to be allocated but not yet faulted in memory).
+	 */
+	MemSet(ptr, 0, requestSize);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-- 
2.51.0

v20251015-0010-fix-pgproc-partitioning-align-size.patchtext/x-patch; charset=UTF-8; name=v20251015-0010-fix-pgproc-partitioning-align-size.patchDownload

From e8f69161eb459f89c8befd2da052f7a6bc0598ea Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 16:52:53 +0200
Subject: [PATCH v20251015 10/12] fix: pgproc partitioning - align size

I'm not sure this is actually required. The libnuma docs say:

  The size argument will be rounded up to a multiple of the system page size.

so why should it be up to us to round it like that?

Discussion: bf95094a-77c2-46cf-913a-443f7419bc79@postgrespro.ru
---
 src/backend/storage/lmgr/proc.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 56812a05860..fc5e1969c36 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -392,6 +392,9 @@ InitProcGlobal(void)
 		Assert(numa_procs_per_node > 0);
 		Assert(numa_nodes > 0);
 
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
 		/*
 		 * Now initialize the PGPROC partition registry with one partition
 		 * per NUMA node (and then one extra partition for auxiliary procs).
@@ -401,9 +404,6 @@ InitProcGlobal(void)
 			/* the last NUMA node may get fewer PGPROC entries, but meh */
 			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
 
-			/* make sure to align the PGPROC array to memory page */
-			ptr = (char *) TYPEALIGN(numa_page_size, ptr);
-
 			/* fill in the partition info */
 			partitions[i].num_procs = node_procs;
 			partitions[i].numa_node = i;
@@ -411,6 +411,9 @@ InitProcGlobal(void)
 
 			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
 
+			/* should have been aligned */
+			Assert(ptr == (char *) TYPEALIGN(numa_page_size, ptr));
+
 			total_procs += node_procs;
 
 			/* don't underflow/overflow the allocation */
@@ -425,9 +428,6 @@ InitProcGlobal(void)
 		 */
 		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
 
-		/* make sure to align the PGPROC array to memory page */
-		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
-
 		/* fill in the partition info */
 		partitions[numa_nodes].num_procs = node_procs;
 		partitions[numa_nodes].numa_node = -1;
@@ -2452,7 +2452,7 @@ pgproc_partitions_prepare(void)
 }
 
 /*
- * doesn't do alignment
+ *
  */
 static char *
 pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
@@ -2465,15 +2465,20 @@ pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
 	/* pointer right after this array */
 	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
 
-	elog(DEBUG1, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
-		 procs_node, ptr, num_procs, node);
-
 	/*
 	 * if node specified, move to node - do this before we start touching the
 	 * memory, to make sure it's not mapped to any node yet
 	 */
 	if (node != -1)
+	{
+		/* align the pointer to the next page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
 		pg_numa_move_to_node((char *) procs_node, ptr, node);
+	}
+
+	elog(DEBUG1, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
 
 	/* add pointers to the PGPROC entries to allProcs */
 	for (int i = 0; i < num_procs; i++)
-- 
2.51.0

v20251015-0011-fix-add-padding-for-pgproc-partitions.patchtext/x-patch; charset=UTF-8; name=v20251015-0011-fix-add-padding-for-pgproc-partitions.patchDownload

From 21ec8c1801857206947c6b08701493a05e5c82cb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 17:32:29 +0200
Subject: [PATCH v20251015 11/12] fix: add padding for pgproc partitions

same as for fast-path locks
---
 src/backend/storage/lmgr/proc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fc5e1969c36..cd938adbc33 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -163,6 +163,11 @@ PGProcShmemSize(void)
 	 *
 	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
 	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
 
 	return size;
 }
-- 
2.51.0

v20251015-0012-fix-clock-sweep-log-level.patchtext/x-patch; charset=UTF-8; name=v20251015-0012-fix-clock-sweep-log-level.patchDownload

From 1b0238f4778789cb1f8fed5be1aa0fe0099edce3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 17:59:26 +0200
Subject: [PATCH v20251015 12/12] fix: clock-sweep log level

---
 src/backend/storage/buffer/freelist.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7f241301dcb..2de5cd4439e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -769,7 +769,7 @@ StrategySyncBalance(void)
 	 */
 	if (avg_allocs < 100)
 	{
-		elog(LOG, "rebalance skipped: not enough allocations (allocs: %u)",
+		elog(INFO, "rebalance skipped: not enough allocations (allocs: %u)",
 			 avg_allocs);
 		return;
 	}
@@ -783,7 +783,7 @@ StrategySyncBalance(void)
 	 */
 	if (delta_allocs < (avg_allocs * 0.1))
 	{
-		elog(LOG, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+		elog(INFO, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
 			 delta_allocs, (uint32) (avg_allocs * 0.1));
 		return;
 	}
-- 
2.51.0

#67

tomas@vondra.me

3 months ago

In reply to: Tomas Vondra (#64)

Re: Adding basic NUMA awareness

On 10/13/25 13:09, Tomas Vondra wrote:

On 10/13/25 01:58, Alexey Makhmutov wrote:

Hi Tomas,

Thank you very much for working on this problem and the entire line of
patches prepared! I've tried to play with these patches a little and
here are some my observations and suggestions.

In the current implementation we try to use all available NUMA nodes on
the machine, however it's often useful to limit the database only to a
set of specific nodes, so that other nodes can be used for other
processes. In my testing I was trying to use one node out of four for
the client program, so I'd liked to limit the database to the remaining
nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to
start the cluster, so the obvious choice for me was to use the
'numa_get_membind' function instead of 'numa_num_configured_nodes' to
get the list of usable nodes. However, it is much easier to work with
logical nodes in the [0; n] range inside the PG code, so I've decided to
add mapping between 'logical nodes' (0-n in PG) to a set of physical
nodes actually returned by 'numa_get_membind'. We may need to map number
in both directions, so two translation tables are allocated and filled
at the first usage of 'pg_numa' functions. It also seems to be a good
idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all
'numa_...' calls in it and this also allows us to hide this mapping in
static functions. Here is the patch, which I've used to test this idea:
https://github.com/Lerm/postgres/
commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161. This idea probably
could be extended by adding some view to expose this mapping to the user
(at least for testing purposes) and allow to explicitly override this
mapping with a GUC setting. With such GUC setting we would be able to
control PG memory usage on NUMA nodes without the need for systemd
resource control or numactl parameters.

I've argued to keep this out of scope for v1, to keep it smaller and
simpler. I'm not against adding that feature, though. If someone writes
a patch to support this. I suppose the commit you linked is a step in
that direction.

On second thought, I probably spoke too soon ...

What I wanted to keep out of scope for v1 is ability to pick NUMA nodes
from Postgres, e.g. setting a GUC to limit which NUMA nodes to use, etc.

But that's not what you proposed here, clearly. You're saying we should
find which NUMA nodes the process is allowed to run, and use those.
Instead of just using all *configured* nodes. And I agree with that.

I'll take a look at your commit 9ec625c. I'm not sure it's a good idea
to have our internal "logical" node ID, and a mapping to external node
ID values (exposed by the libnuma). I was thinking maybe we should use
just the external IDs, but it's true that'd be tricky when iterating
through nodes, etc. So maybe having such mapping is a good approach.

Another thing I wasn't sure about is checking for memory-only nodes. For
example rpi5 has a NUMA node for each 1GB of memory, and each CPU is
mapped to all those nodes. For buffers this probably does not matter,
but we probably should not use those NUMA nodes for PGPROC partitioning.

regards

--
Tomas Vondra

#68

[1]: /messages/by-id/51e51832-7f47-412a-a1a6-b972101cc8cb@vondra.me
/messages/by-id/51e51832-7f47-412a-a1a6-b972101cc8cb@vondra.me

tomas@vondra.me

2 months ago

In reply to: Tomas Vondra (#66)

8 attachment(s)

Re: Adding basic NUMA awareness

Hi,

here's a significantly reworked version of this patch series.

I had a couple discussions about these patches at pgconf.eu last week,
and one interesting suggestion was that maybe it'd be easier to the
clock-sweep partitioning first, in a NUMA-oblivious way. And then add
the NUMA stuff later.

The logic is that this way we could ignore some of the hard stuff (e.g.
handling huge page reservation), while still reducing clocksweep
contention. Which we speculated might be the main benefit anyway.

The attached patches do this.

0001 - Introduces a simplified version of the "buffer partition
registry" (think array in shmem, storing info about ranges of shared
buffer). The partitions are calculated as simple fraction of shared
buffers. There's no need to align the partitions to memory pages etc.

0002-0005 - Does the clock-sweep partitioning. I chose to keep this
split into smaller increments, to keep the patches easier to review.

0006 - Make the partitioning NUMA-aware. This used to be part of 0001,
but now it's moved on top of the clock-sweep stuff. It ensures the
partitions are properly aligned to memory pages, and all that.

0007 - PGPROC partitioning.

This made the 0001 patch much simpler/smaller - it used to be ~50kB, now
it's 15kB (and most of the complexity is in 0006).

The question however is how this performs, or how much of the benefit
was due to NUMA-awareness and how much was due to just partitioning
clock-sweep. I repeated the benchmark from [1]/messages/by-id/51e51832-7f47-412a-a1a6-b972101cc8cb@vondra.me, doing concurrent
sequential scans to put significant pressure on buffer replacements, and
I got this:

hp clients | master | sweep sweep-16 | numa numa-16
=============|==========|===================|===============
off 16 | 24 | 46 46 | 33 40
32 | 33 | 53 51 | 45 51
48 | 38 | 51 61 | 46 56
64 | 41 | 56 75 | 47 65
80 | 47 | 53 77 | 48 71
96 | 45 | 54 80 | 47 66
112 | 45 | 52 83 | 44 65
128 | 43 | 55 81 | 39 48
-------------|----------|-------------------|---------------
on 16 | 26 | 47 47 | 35 42
32 | 33 | 49 52 | 40 49
48 | 39 | 52 63 | 43 57
64 | 42 | 53 72 | 43 66
80 | 43 | 54 81 | 46 71
96 | 48 | 58 80 | 49 73
112 | 51 | 58 78 | 51 76
128 | 55 | 60 83 | 52 76

"hp" means huge pages, the compared branches are:

- master - current master
- sweep - patches up to 0005, default number of partitions (4)
- sweep-16 - patches up to 0005, 16 partitions
- numa - patches up to 0006, default number of partitions (4)
- numa-16 - patches up to 0006, 16 partitions

Compared to master, the results look like this:

hp clients | sweep sweep-16 | numa numa-16
==============|====================|================
off 16 | 192% 192% | 138% 167%
32 | 161% 155% | 136% 155%
48 | 132% 160% | 121% 145%
64 | 137% 183% | 115% 159%
80 | 113% 164% | 102% 151%
96 | 120% 177% | 104% 146%
112 | 116% 184% | 98% 144%
128 | 128% 186% | 90% 110%
--------------|--------------------|----------------
on 16 | 181% 181% | 135% 162%
32 | 148% 158% | 121% 148%
48 | 133% 161% | 110% 144%
64 | 126% 171% | 102% 157%
80 | 126% 188% | 107% 165%
96 | 121% 167% | 102% 152%
112 | 114% 153% | 100% 149%
128 | 109% 151% | 95% 138%

The attached PDF has more results for runs with somewhat modified
parameters, but the overall it's very similar to these numbers.

I think this confirms most of the benefit really comes from just
partitioning clock-sweep, and it's mostly independent of the NUMA stuff.
In fact, the NUMA partitioning is often slower. Some of this may be due
to inefficiencies in the patch (e.g. division in formula calculating the
partition index, etc.).

So I think this looks quite promising ...

There are a couple unsolved issues, though. While running the tests, I
ran into a bunch of weird issues. I saw two types of failures:

1) Bad address
-----------------------------------------------------------------------
2025-10-30 15:24:21.195 UTC [2038558] LOG: could not read blocks
114543..114558 in file "base/16384/16588": Bad address
2025-10-30 15:24:21.195 UTC [2038558] STATEMENT: SELECT * FROM t_41
OFFSET 1000000000

2025-10-30 15:24:21.195 UTC [2038523] LOG: could not read blocks
119981..119996 in file "base/16384/16869": Bad address
2025-10-30 15:24:21.195 UTC [2038523] CONTEXT: completing I/O on behalf
of process 2038464
2025-10-30 15:24:21.195 UTC [2038523] STATEMENT: SELECT * FROM t_96
OFFSET 1000000000

2025-10-30 15:24:21.195 UTC [2038492] LOG: could not read blocks
118226..118232 in file "base/16384/16478": Bad address
2025-10-30 15:24:21.195 UTC [2038492] STATEMENT: SELECT * FROM t_19
OFFSET 1000000000

2025-10-30 15:24:21.196 UTC [2038477] LOG: could not read blocks
120515..120517 in file "base/16384/16945": Bad address
2025-10-30 15:24:21.196 UTC [2038477] CONTEXT: completing I/O on behalf
of process 2038545
2025-10-30 15:24:21.196 UTC [2038477] STATEMENT: SELECT * FROM t_111
OFFSET 1000000000
-----------------------------------------------------------------------

2) Operation canceled
-----------------------------------------------------------------------
2025-10-31 10:57:21.742 UTC [2685933] LOG: could not read blocks
159..174 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.742 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000

2025-10-31 10:57:21.742 UTC [2685933] LOG: could not read blocks
143..158 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.742 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000

2025-10-31 10:57:21.781 UTC [2685933] ERROR: could not read blocks
143..158 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.781 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000
-----------------------------------------------------------------------

I'm still not sure what's causing these, and it's happening rarely and
randomly, so it's hard to catch and reproduce. I'd welcome suggestions
what to look for / what might be the issue.

I did run the whole test under valgrind to make sure there's nothing
obviously broken, but that found no issues. Of course, it's much slower
under valgrind, so maybe it just didn't hit the issue.

I suspect the "bad address" might be just a different symptom of the
issues with reserving huge pages I already mentioned [2]/messages/by-id/1d57d68d-b178-415a-ba11-be0c3714638e@vondra.me. I assume
io_uring might try using huge pages internally, and then it fails
because postgres also reserves huge pages.

I have no idea what "operation canceled" might be about.

I'm not entirely sure if this affect all patches, or just the patches
with NUMA partitioning. Or if this happens with huge pages. I'll do more
runs to test specifically this.

But it does seem to be specific to io_uring - or at least the canceled
issue. I haven't seen it after switching to "worker".

[2]: /messages/by-id/1d57d68d-b178-415a-ba11-be0c3714638e@vondra.me
/messages/by-id/1d57d68d-b178-415a-ba11-be0c3714638e@vondra.me

regards

--
Tomas Vondra

Attachments:

clocksweep-results.pdfapplication/pdf; name=clocksweep-results.pdfDownload

v20251101-0007-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20251101-0007-NUMA-partition-PGPROC.patchDownload

From 5ecfc8c4f042219b4691e5f0770c72f55c4830ac Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 8 Sep 2025 13:11:02 +0200
Subject: [PATCH v20251101 7/7] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  96 ++-
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 ++-
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 550 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 722 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index dc2ce019283..306063e159e 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -33,3 +33,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 61207357e53..a859962f5f8 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -29,7 +30,8 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -932,3 +935,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 33369fbe23a..afa74466006 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 976638a58ac..5e7b0ac8850 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -107,8 +107,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ce6b5299324..3288900bb6f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -292,7 +292,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1f142f20c7..011fecfc58b 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f8314a0e299..d0efa102d82 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -799,6 +799,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 341eaf55577..2de5cd4439e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -472,7 +472,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4cc7f645c31..d01f486876d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..cd938adbc33 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,32 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -76,8 +87,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -90,6 +101,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -100,11 +134,41 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
 	return size;
 }
 
@@ -129,6 +193,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -140,12 +258,16 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
+	size = add_size(size, PGProcPartitionsShmemSize());
 
 	return size;
 }
@@ -191,7 +313,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -210,6 +332,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -224,6 +349,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -239,21 +373,110 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
-	MemSet(ptr, 0, requestSize);
-
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			/* should have been aligned */
+			Assert(ptr == (char *) TYPEALIGN(numa_page_size, ptr));
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
+	/*
+	 * Don't memset the memory before locating it to NUMA nodes (which requires
+	 * the pages to be allocated but not yet faulted in memory).
+	 */
+	MemSet(ptr, 0, requestSize);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +509,92 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
+
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		total_procs += node_procs;
 
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
 
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
+
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
 		 * dummy PGPROCs don't need these though - they're never associated
@@ -366,9 +657,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -435,7 +723,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -646,7 +978,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1049,7 +1381,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1098,7 +1430,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1988,7 +2320,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2395,173 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ *
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+	{
+		/* align the pointer to the next page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+	}
+
+	elog(DEBUG1, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f7730ece976..be4a6334fb7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1878,6 +1878,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.0

v20251101-0006-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251101-0006-NUMA-shared-buffers-partitioning.patchDownload

From cff7ec32ca1256227d4c2ea51c61986b10c0f28d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 29 Oct 2025 21:41:26 +0100
Subject: [PATCH v20251101 6/7] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).
---
 .../pg_buffercache--1.6--1.7.sql              |   1 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  44 +-
 src/backend/storage/buffer/buf_init.c         | 569 +++++++++++++++++-
 src/backend/storage/buffer/freelist.c         |  88 ++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  14 +-
 src/include/storage/bufmgr.h                  |   4 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 11 files changed, 736 insertions(+), 68 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 2c4d560514d..dc2ce019283 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -13,6 +13,7 @@ LANGUAGE C PARALLEL SAFE;
 CREATE VIEW pg_buffercache_partitions AS
 	SELECT P.* FROM pg_buffercache_partitions() AS P
 	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
 	 last_buffer integer,		-- last buffer of partition
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b0b9112fdba..61207357e53 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -814,19 +814,21 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
 						   INT8OID, -1, 0);
@@ -850,7 +852,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	{
 		uint32		i = funcctx->call_cntr;
 
-		int			num_buffers,
+		int			numa_node,
+					num_buffers,
 					first_buffer,
 					last_buffer;
 
@@ -869,7 +872,7 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
-		BufferPartitionGet(i, &num_buffers,
+		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
@@ -887,36 +890,39 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
-		values[1] = Int32GetDatum(num_buffers);
+		values[1] = Int32GetDatum(numa_node);
 		nulls[1] = false;
 
-		values[2] = Int32GetDatum(first_buffer);
+		values[2] = Int32GetDatum(num_buffers);
 		nulls[2] = false;
 
-		values[3] = Int32GetDatum(last_buffer);
+		values[3] = Int32GetDatum(first_buffer);
 		nulls[3] = false;
 
-		values[4] = Int64GetDatum(complete_passes);
+		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
-		values[5] = Int32GetDatum(next_victim_buffer);
+		values[5] = Int64GetDatum(complete_passes);
 		nulls[5] = false;
 
-		values[6] = Int64GetDatum(buffer_total_allocs);
+		values[6] = Int32GetDatum(next_victim_buffer);
 		nulls[6] = false;
 
-		values[7] = Int64GetDatum(buffer_allocs);
+		values[7] = Int64GetDatum(buffer_total_allocs);
 		nulls[7] = false;
 
-		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
-		values[9] = Int64GetDatum(buffer_req_allocs);
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
 		nulls[9] = false;
 
-		values[10] = PointerGetDatum(array);
+		values[10] = Int64GetDatum(buffer_req_allocs);
 		nulls[10] = false;
 
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0362fda24aa..f8314a0e299 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,12 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -29,15 +35,24 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-/* *
- * number of buffer partitions */
-#define NUM_CLOCK_SWEEP_PARTITIONS	4
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
 
 /* Array of structs with information about buffer ranges */
 BufferPartitions *BufferPartitionsArray = NULL;
 
+static void buffer_partitions_prepare(void);
 static void buffer_partitions_init(void);
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
+
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -85,25 +100,85 @@ BufferManagerShmemInit(void)
 				foundIOCV,
 				foundBufCkpt,
 				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
 	/* allocate the partition registry first */
 	BufferPartitionsArray = (BufferPartitions *)
 		ShmemInitStruct("Buffer Partitions",
 						offsetof(BufferPartitions, partitions) +
-						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						mul_size(sizeof(BufferPartition), numa_partitions),
 						&foundParts);
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -133,7 +208,10 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
-		/* Initialize buffer partitions (calculate buffer ranges). */
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
 		buffer_partitions_init();
 
 		/*
@@ -172,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -201,11 +286,244 @@ BufferManagerShmemSize(void)
 
 	/* account for registry of NUMA partitions */
 	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
-								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
 
 	return size;
 }
 
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	/* no NUMA-aware partitioning */
+	if ((numa_flags & NUMA_BUFFERS) == 0)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(NOTICE, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+	else if ((numa_flags & NUMA_BUFFERS) == 0)
+	{
+		elog(NOTICE, "NUMA-partitioning of buffers disabled");
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or when we can't partition for some
+	 * reason, just take a "fair share" of buffers. This can happen for a
+	 * number of reasons - missing NUMA support, partitioning of buffers not
+	 * enabled, or not enough buffers for this many nodes.
+	 *
+	 * We still build partitions, because we want to allow partitioning of
+	 * the clock-sweep later.
+	 *
+	 * The number of buffers for each partition is calculated later, once we
+	 * have allocated the shared memory (because that's where we store it).
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
 /*
  * Sanity checks of buffers partitions - there must be no gaps, it must cover
  * the whole range of buffers, etc.
@@ -267,33 +585,137 @@ buffer_partitions_init(void)
 {
 	int			remaining_buffers = NBuffers;
 	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
 
-	/* number of buffers per partition (make sure to not overflow) */
-	int			part_buffers
-		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
-
-	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
 
-	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	for (int n = 0; n < numa_nodes; n++)
 	{
-		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
 
-		/* buffers this partition should get (last partition can get fewer) */
-		int			num_buffers = Min(remaining_buffers, part_buffers);
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
 
-		remaining_buffers -= num_buffers;
+		remaining_buffers -= node_buffers;
 
-		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
-		Assert((buffer >= 0) && (buffer < NBuffers));
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
 
-		part->num_buffers = num_buffers;
-		part->first_buffer = buffer;
-		part->last_buffer = buffer + (num_buffers - 1);
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
 
-		buffer += num_buffers;
+			/* XXX we should get the actual node ID from the mask */
+			if (numa_can_partition)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
 	}
 
 	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not map pages
+	 * one by one.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 *
+	 * We always map all partitions for the same node at once, so that we
+	 * don't need to worry about alignment of memory pages that get split
+	 * between partitions (we only worry about min_node_buffers for whole
+	 * NUMA nodes, not for individual partitions).
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		char	   *startptr,
+				   *endptr;
+		int			num_buffers = 0;
+
+		/* sum buffers in all partitions for this node */
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int		pidx = (n * parts_per_node + p);
+			BufferPartition *part = &BufferPartitionsArray->partitions[pidx];
+
+			Assert(part->numa_node == n);
+
+			num_buffers += part->num_buffers;
+		}
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => buffers %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => descriptors %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
 }
 
 int
@@ -302,14 +724,21 @@ BufferPartitionCount(void)
 	return BufferPartitionsArray->npartitions;
 }
 
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
 void
-BufferPartitionGet(int idx, int *num_buffers,
+BufferPartitionGet(int idx, int *node, int *num_buffers,
 				   int *first_buffer, int *last_buffer)
 {
 	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
 	{
 		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
 
+		*node = part->numa_node;
 		*num_buffers = part->num_buffers;
 		*first_buffer = part->first_buffer;
 		*last_buffer = part->last_buffer;
@@ -322,8 +751,82 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 /* return parameters before the partitions are initialized (during sizing) */
 void
-BufferPartitionParams(int *num_partitions)
+BufferPartitionParams(int *num_partitions, int *num_nodes)
 {
 	if (num_partitions)
-		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c73a418defa..341eaf55577 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -124,7 +124,9 @@ typedef struct
 	//int			__attribute__((aligned(64))) bgwprocno;
 
 	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	/* clocksweep partitions */
 	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
@@ -270,16 +272,72 @@ ClockSweepTick(ClockSweep *sweep)
  * calculate_partition_index
  *		calculate the buffer / clock-sweep partition to use
  *
- * use PID to determine the buffer partition
- *
- * XXX We could use NUMA node / core ID to pick partition, but we'd need
- * to handle cases with fewer nodes/cores than partitions somehow. Although,
- * maybe the balancing would handle that too.
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
  */
 static int
 calculate_partition_index(void)
 {
-	return (MyProcPid % StrategyControl->num_partitions);
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
 }
 
 /*
@@ -947,7 +1005,7 @@ StrategyShmemSize(void)
 	Size		size = 0;
 	int			num_partitions;
 
-	BufferPartitionParams(&num_partitions);
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -974,9 +1032,17 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
@@ -1011,7 +1077,8 @@ StrategyInitialize(bool init)
 		/* Initialize the clock sweep pointers (for all partitions) */
 		for (int i = 0; i < num_partitions; i++)
 		{
-			int			num_buffers,
+			int			node,
+						num_buffers,
 						first_buffer,
 						last_buffer;
 
@@ -1020,7 +1087,8 @@ StrategyInitialize(bool init)
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
 			/* get info about the buffer partition */
-			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
 
 			/*
 			 * FIXME This may not quite right, because if NBuffers is not a
@@ -1056,6 +1124,8 @@ StrategyInitialize(bool init)
 
 		/* initialize the partitioned clocksweep */
 		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index d6fc8333850..1da2a43220a 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -906,6 +906,16 @@
   boot_val => 'true',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'sync_replication_slots', type => 'bool', context => 'PGC_SIGHUP', group => 'REPLICATION_STANDBY',
   short_desc => 'Enables a physical standby to synchronize logical failover replication slots from the primary server.',
   variable => 'sync_replication_slots',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..d17ee9ca861 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 776be11dec1..4ae0c32dfe5 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -275,10 +275,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -288,7 +288,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -490,8 +490,8 @@ extern void AtEOXact_LocalBuffers(bool isCommit);
 
 extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
-extern void BufferPartitionGet(int idx, int *num_buffers,
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
-extern void BufferPartitionParams(int *num_partitions);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4e7b1fcd4ab..510018db115 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,10 +156,12 @@ typedef struct ReadBuffersOperation ReadBuffersOperation;
 /*
  * information about one partition of shared buffers
  *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
  * first/last buffer - the values are inclusive
  */
 typedef struct BufferPartition
 {
+	int			numa_node;		/* NUMA node (-1 no node) */
 	int			num_buffers;	/* number of buffers */
 	int			first_buffer;	/* first buffer of partition */
 	int			last_buffer;	/* last buffer of partition */
@@ -169,6 +171,7 @@ typedef struct BufferPartition
 typedef struct BufferPartitions
 {
 	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
 	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
 } BufferPartitions;
 
@@ -346,6 +349,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..8ee0e7d211c 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -106,6 +109,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -128,4 +161,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
+#endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
 #endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
-- 
2.51.0

v20251101-0005-clock-sweep-weighted-balancing.patchtext/x-patch; charset=UTF-8; name=v20251101-0005-clock-sweep-weighted-balancing.patchDownload

From 59aeeec92549e95c52446c59a1c1308d548d277b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20251101 5/7] clock-sweep: weighted balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.

Note: This may be more important with NUMA-aware partitioning, which
restricts the allowed sizes of partiions (especially with huge pages).
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 38f339b02c4..c73a418defa 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -619,6 +619,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -645,16 +659,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -662,8 +687,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -726,6 +757,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -733,7 +768,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -747,22 +782,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.0

v20251101-0004-clock-sweep-scan-all-partitions.patchtext/x-patch; charset=UTF-8; name=v20251101-0004-clock-sweep-scan-all-partitions.patchDownload

From e4252a2559496947f6862ee0851a88dd2f59ea11 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 13:59:29 +0200
Subject: [PATCH v20251101 4/7] clock-sweep: scan all partitions

When looking for a free buffer, scan all clock-sweep partitions, not
just the "home" one. All buffers in the home partition may be pinned, in
which case we should not fail. Instead, advance to the next partition,
in a round-robin way, and only fail after scanning through all of them.
---
 src/backend/storage/buffer/freelist.c     | 91 ++++++++++++++++-------
 src/test/recovery/t/027_stream_regress.pl |  5 --
 2 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 8897f2e9b0e..38f339b02c4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -167,6 +167,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 static ClockSweep *ChooseClockSweep(bool balance);
+static BufferDesc *StrategyGetBufferPartition(ClockSweep *sweep,
+											  BufferAccessStrategy strategy,
+											  uint32 *buf_state);
 
 /*
  * clocksweep allocation balancing
@@ -201,10 +204,9 @@ static int clocksweep_count = 0;
  * id of the buffer now under the hand.
  */
 static inline uint32
-ClockSweepTick(void)
+ClockSweepTick(ClockSweep *sweep)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -370,7 +372,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
+	ClockSweep *sweep,
+			   *sweep_start;		/* starting clock-sweep partition */
 
 	*from_ring = false;
 
@@ -424,37 +427,69 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
 	 *
-	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
-	 * buffers from a single partition, aligned with the NUMA node. That means
-	 * it only accesses buffers from the same NUMA node.
-	 *
-	 * XXX That also means each process "sweeps" only a fraction of buffers,
-	 * even if the other buffers are better candidates for eviction. Maybe
-	 * there should be some logic to "steal" buffers from other freelists or
-	 * other nodes?
+	 * Start with the "preferred" partition, and then proceed in a round-robin
+	 * manner. If we cycle back to the starting partition, it means none of the
+	 * partitions has unpinned buffers.
 	 *
-	 * XXX Would that also mean we'd have multiple bgwriters, one for each
-	 * node, or would one bgwriter handle all of that?
+	 * XXX Does this need to do similar balancing "balancing" as for bgwriter
+	 * in StrategySyncBalance? Maybe it's be enough to simply pick the initial
+	 * partition that way? We'd only getting a single buffer, so not much chance
+	 * to balance over many allocations.
 	 *
-	 * XXX This only searches a single partition, which can result in "no
-	 * unpinned buffers available" even if there are buffers in other
-	 * partitions. Should be fixed by falling back to other partitions if
-	 * needed.
-	 *
-	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
-	 * count for that one partition. In fact, this should not call ClockSweepTick
-	 * for every iteration. The call is likely quite expensive (does a lot
-	 * of stuff), and also may return a different partition on each call.
-	 * We should just do it once, and then do the for(;;) loop. And then
-	 * maybe advance to the next partition, until we scan through all of them.
+	 * XXX But actually, we're calling ChooseClockSweep() with balance=true, so
+	 * maybe it already does balancing?
 	 */
-	trycounter = NBuffers;
+	sweep = ChooseClockSweep(true);
+	sweep_start = sweep;
+	for (;;)
+	{
+		buf = StrategyGetBufferPartition(sweep, strategy, buf_state);
+
+		/* found a buffer in the "sweep" partition, we're done */
+		if (buf != NULL)
+			return buf;
+
+		/*
+		 * Try advancing to the next partition, round-robin (if last partition,
+		 * wrap around to the beginning).
+		 *
+		 * XXX This is a bit ugly, there must be a better way to advance to the
+		 * next partition.
+		 */
+		if (sweep == &StrategyControl->sweeps[StrategyControl->num_partitions - 1])
+			sweep = StrategyControl->sweeps;
+		else
+			sweep++;
+
+		/* we've scanned all partitions */
+		if (sweep == sweep_start)
+			break;
+	}
+
+	/* we shouldn't get here if there are unpinned buffers */
+	elog(ERROR, "no unpinned buffers available");
+}
+
+/*
+ * StrategyGetBufferPartition
+ *		get a free buffer from a single clock-sweep partition
+ *
+ * Returns NULL if there are no free (unpinned) buffers in the partition.
+*/
+static BufferDesc *
+StrategyGetBufferPartition(ClockSweep *sweep, BufferAccessStrategy strategy,
+						   uint32 *buf_state)
+{
+	BufferDesc *buf;
+	int			trycounter;
+
+	trycounter = sweep->numBuffers;
 	for (;;)
 	{
 		uint32		old_buf_state;
 		uint32		local_buf_state;
 
-		buf = GetBufferDescriptor(ClockSweepTick());
+		buf = GetBufferDescriptor(ClockSweepTick(sweep));
 
 		/*
 		 * Check whether the buffer can be used and pin it if so. Do this
@@ -482,7 +517,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 					 * one eventually, but it's probably better to fail than
 					 * to risk getting stuck in an infinite loop.
 					 */
-					elog(ERROR, "no unpinned buffers available");
+					return NULL;
 				}
 				break;
 			}
@@ -500,7 +535,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = sweep->numBuffers;
 					break;
 				}
 			}
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 98b146ed4b7..589c79d97d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,11 +18,6 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
-# The default is 1MB, which is not enough with clock-sweep partitioning.
-# Increase to 32MB, so that we don't get "no unpinned buffers".
-$node_primary->append_conf('postgresql.conf',
-	'shared_buffers = 32MB');
-
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
-- 
2.51.0

v20251101-0003-clock-sweep-balancing-of-allocations.patchtext/x-patch; charset=UTF-8; name=v20251101-0003-clock-sweep-balancing-of-allocations.patchDownload

From a2cb64b97d17bf271ec0aa9496601b1d644a94e2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 29 Oct 2025 21:45:34 +0100
Subject: [PATCH v20251101 3/7] clock-sweep: balancing of allocations

If backends only allocate buffers from the "home" partition, that may
cause significant misbalance. Some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 14e750beeff..2c4d560514d 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -21,7 +21,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 96eee21932f..b0b9112fdba 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -795,6 +797,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -824,6 +828,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -844,11 +854,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -857,8 +873,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -884,6 +908,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[7] = Int64GetDatum(buffer_allocs);
 		nulls[7] = false;
 
+		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[8] = false;
+
+		values[9] = Int64GetDatum(buffer_req_allocs);
+		nulls[9] = false;
+
+		values[10] = PointerGetDatum(array);
+		nulls[10] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 15ca6f60d57..176a6e6358c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3913,6 +3913,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 1d2f97b4d1a..8897f2e9b0e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -130,7 +166,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -142,7 +204,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -233,11 +295,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -309,7 +419,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -417,6 +527,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(INFO, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(INFO, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -443,6 +771,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -627,7 +956,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1000,8 +1343,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1009,11 +1354,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1386856046c..776be11dec1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -443,6 +443,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7052f9de57c..4e7b1fcd4ab 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -360,11 +360,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.0

v20251101-0002-clock-sweep-basic-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251101-0002-clock-sweep-basic-partitioning.patchDownload

From 1eb17c28164b30f759208b56eaa699236582c008 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 29 Oct 2025 21:39:47 +0100
Subject: [PATCH v20251101 2/7] clock-sweep: basic partitioning

Partitions the "clock-sweep" algorithm to work on individual partitions,
one by one. Each backend process is mapped to one "home" partition, with
an independent clock hand. This reduces contention for workloads with
significant buffer pressure.

The patch extends the "pg_buffercache_partitions" view to include
information about the clock-sweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  38 ++-
 src/backend/storage/buffer/buf_init.c         |   7 +
 src/backend/storage/buffer/bufmgr.c           | 186 ++++++++----
 src/backend/storage/buffer/freelist.c         | 283 +++++++++++++++---
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 432 insertions(+), 106 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index f1c20960b7e..14e750beeff 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -15,7 +15,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	(partition integer,			-- partition index
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index dfec1380736..96eee21932f 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -810,12 +810,20 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "num_buffers",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "first_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "last_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -836,12 +844,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -854,6 +872,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[3] = Int32GetDatum(last_buffer);
 		nulls[3] = false;
 
+		values[4] = Int64GetDatum(complete_passes);
+		nulls[4] = false;
+
+		values[5] = Int32GetDatum(next_victim_buffer);
+		nulls[5] = false;
+
+		values[6] = Int64GetDatum(buffer_total_allocs);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_allocs);
+		nulls[7] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 007f7806dc4..0362fda24aa 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -320,3 +320,10 @@ BufferPartitionGet(int idx, int *num_buffers,
 	elog(ERROR, "invalid partition index");
 }
 
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions)
+{
+	if (num_partitions)
+		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e8544acb784..15ca6f60d57 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3609,33 +3609,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3663,25 +3659,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3693,17 +3680,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3711,11 +3698,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3735,9 +3722,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3751,15 +3738,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3767,9 +3755,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3779,7 +3767,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3787,10 +3775,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3817,7 +3805,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3842,20 +3830,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3869,7 +3857,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3899,8 +3887,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7fe34d3ef4c..1d2f97b4d1a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,32 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_partitions;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +130,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +142,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +150,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +183,61 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * use PID to determine the buffer partition
+ *
+ * XXX We could use NUMA node / core ID to pick partition, but we'd need
+ * to handle cases with fewer nodes/cores than partitions somehow. Although,
+ * maybe the balancing would handle that too.
+ */
+static int
+calculate_partition_index(void)
+{
+	return (MyProcPid % StrategyControl->num_partitions);
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -224,9 +309,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -306,6 +417,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -313,37 +464,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -380,6 +538,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -387,6 +548,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -402,6 +567,10 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_partitions;
+
+	num_partitions = BufferPartitionCount();
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -419,7 +588,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -431,15 +601,40 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
 	}
 	else
 		Assert(!init);
@@ -802,3 +997,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 80b0c14831c..1386856046c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -443,7 +443,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -489,5 +491,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 24860c6c2c4..7052f9de57c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -359,6 +359,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e24703d6688..f7730ece976 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.0

v20251101-0001-Infrastructure-for-partitioning-shared-buf.patchtext/x-patch; charset=UTF-8; name=v20251101-0001-Infrastructure-for-partitioning-shared-buf.patchDownload

From 03ed6e5116bd5769032823c59e5169b8219ad26b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:04:29 +0200
Subject: [PATCH v20251101 1/7] Infrastructure for partitioning shared buffers

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, etc. This serves as a source of truth
for later patches (e.g. to partition clock-sweep).

The registry is a small BufferPartitions array in shared memory, with
partitions sized to be a fair share of shared buffers. Later patches may
improve this to consider NUMA, and similar details.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.

Note: This partitioning is independent of the partitions defined in
lwlock.h, which defines 128 partitions to reduce lock conflict on the
buffer mapping hashtable. The number of partitions introduced by this
patch is expected to be much lower (a dozen or so).
---
 contrib/pg_buffercache/Makefile               |   2 +-
 contrib/pg_buffercache/meson.build            |   1 +
 .../pg_buffercache--1.6--1.7.sql              |  25 +++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  86 +++++++++++
 src/backend/storage/buffer/buf_init.c         | 145 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   6 +
 src/include/storage/bufmgr.h                  |  19 +++
 src/tools/pgindent/typedefs.list              |   2 +
 9 files changed, 285 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 7cd039a1df9..7c31141881f 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -24,6 +24,7 @@ install_data(
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache--1.5--1.6.sql',
+  'pg_buffercache--1.6--1.7.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..f1c20960b7e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,25 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3df04c98959..dfec1380736 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -777,3 +779,87 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(num_buffers);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(first_buffer);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(last_buffer);
+		nulls[3] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..007f7806dc4 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,11 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +29,14 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/* *
+ * number of buffer partitions */
+#define NUM_CLOCK_SWEEP_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_init(void);
 
 /*
  * Data Structures:
@@ -70,7 +83,15 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						&foundParts);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -112,6 +133,9 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/* Initialize buffer partitions (calculate buffer ranges). */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -175,5 +199,124 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+
 	return size;
 }
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+
+	/* number of buffers per partition (make sure to not overflow) */
+	int			part_buffers
+		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
+
+	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+
+	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+
+		/* buffers this partition should get (last partition can get fewer) */
+		int			num_buffers = Min(remaining_buffers, part_buffers);
+
+		remaining_buffers -= num_buffers;
+
+		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+		Assert((buffer >= 0) && (buffer < NBuffers));
+
+		part->num_buffers = num_buffers;
+		part->first_buffer = buffer;
+		part->last_buffer = buffer + (num_buffers - 1);
+
+		buffer += num_buffers;
+	}
+
+	AssertCheckBufferPartitions();
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
+
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c1206a46aba..80b0c14831c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -321,6 +321,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -484,4 +485,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d42..24860c6c2c4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,25 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* to avoid having to expose buf_internals.h here */
 typedef struct WritebackContext WritebackContext;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ac2da4c98cf..e24703d6688 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.0

#69

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Tomas Vondra (#68)

2 attachment(s)

Re: Adding basic NUMA awareness

On Fri, Oct 31, 2025 at 12:57 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

here's a significantly reworked version of this patch series.

I had a couple discussions about these patches at pgconf.eu last week,[..]

I've just had a quick look at this and oh, my, I've started getting
into this partitioned clocksweep and that's ambitious! Yes, this
sequencing of patches makes it much more understandable. Anyway I've
spotted some things, attempted to fix some and have some basic
questions too (so small baby steps, all of this was on 4s/4 NUMA nodes
with HP on) -- the 000X refers to question/issue/bug in specific
patchset file:

0001: you mention 'debug_numa = buffers' in commitmsg, but there's
nothing there like that? it comes with 0006

0002: dunno, but wouldn't it make some educational/debugging sense to
add a debug function returning clocksweep partition index
(calculate_partition_index) for backend? (so that we know which
partition we are working on right now)

0003: those two "elog(INFO, "rebalance skipped:" should be at DEBUG2+
IMHO (they are way too verbose during runs)

0006a: Needs update - s/patches later in the patch series/patches
earlier in the patch series/

0006b: IMHO longer term, we should hide some complexity of those calls
via src/port numa shims (pg_numa_sched_cpu()?)

0006c: after GUC commit fce7c73fba4e5, apply complains with:
error: patch failed: src/backend/utils/misc/guc_parameters.dat:906
error: src/backend/utils/misc/guc_parameters.dat: patch does not apply

0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
I've thought it is going to be simpler to use to_hex() -- see 0009
attached.

0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?

0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
during start:

2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
num_procs 2523 node 0
2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
num_procs 2523 node 1
2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
num_procs 2523 node 2
2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
num_procs 2523 node 3
2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument

0007d: so we probably need numa_warn()/numa_error() wrappers (this was
initially part of NUMA observability patches but got removed during
the course of action), I'm attaching 0008. With that you'll get
something a little more up to our standards:
2025-11-04 10:27:07.140 CET [59696] DEBUG:
fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
0x7f4f4d4b1660
2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind

0007e: elog DEBUG says it's pg_proc_init_partition but it's
pgproc_partition_init() actually ;)

0007f: The "mbind: Invalid argument"" issue itself with the below addition:
+elog(DEBUG1, "NUMA: fastpath_partition_init ptr %p endptr %p
num_procs %d node %d", ptr, endptr, num_procs, node);
showed this:
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eea00000 endptr 0x7f39eeab1660
num_procs 2523 node 0
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eec00000 endptr 0x7f39eecb1660
num_procs 2523 node 1
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eee00000 endptr 0x7f39eeeb1660
num_procs 2523 node 2
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
[..]

Meanwhile it's full hugepage size (e.g. 0x7f39eec00000−0x7f39eea00000 = 2MB)
$ grep --color 7f39ee[ace] /proc/61841/smaps
7f39ee800000-7f39eea00000 rw-s 87de00000 00:11 122710
/anon_hugepage (deleted)
7f39eea00000-7f39eec00000 rw-s 87e000000 00:11 122710
/anon_hugepage (deleted)
7f39eec00000-7f39eee00000 rw-s 87e200000 00:11 122710
/anon_hugepage (deleted)
7f39eee00000-7f39ef000000 rw-s 87e400000 00:11 122710
/anon_hugepage (deleted)

but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
0xB1660 = 726624 bytes, but if adjust blindly endptr in that
fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
(HP) it doesn't complain anymore and I get success:
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7000000 endptr 0x7f7bf7200000
num_procs 2523 node 0
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7200000 endptr 0x7f7bf7400000
num_procs 2523 node 1
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7400000 endptr 0x7f7bf7600000
num_procs 2523 node 2
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7600000 endptr 0x7f7bf7800000
num_procs 2523 node 3
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7800000 endptr 0x7f7bf7a00000
num_procs 38 node -1
2025-11-04 12:08:30.239 CET [62352] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit

0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) print ptr
$1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x0000561a672a0efe in ExecMakeFunctionResultSet
(fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
isDone=isDone@entry=0x561a97e8e5c0) at
../src/backend/executor/execSRF.c:624
[..]

Postmaster had still attached shm (visible via smaps), and if you
compare closely 0x7f4ed0200000 against sorted smaps:

7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
/anon_hugepage (deleted)
7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
/anon_hugepage (deleted)
7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
/anon_hugepage (deleted)
7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
/anon_hugepage (deleted)
7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
/anon_hugepage (deleted)

it's NOT there at all (there's no mmap region starting with
0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
aware of this new mmaped() regions and instead does simple loop over
all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
os_page_size)"?

0006e:
I'm seeking confirmation, but is this the issue we have discussed
on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
$ numactl --membind="0,1" --cpunodebind="0,1"
/usr/pgsql19/bin/pg_ctl -D /path start
still shows 4 NUMA nodes used. Current patches use
numa_num_configured_nodes(), but it says 'This count includes any
nodes that are currently DISABLED'. So I was wondering if I could help
by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
It's the same as You wrote earlier to Alexy?

But that's not what you proposed here, clearly. You're saying we should
find which NUMA nodes the process is allowed to run, and use those.
Instead of just using all *configured* nodes. And I agree with that.

So are you already on it ?

There are a couple unsolved issues, though. While running the tests, I
ran into a bunch of weird issues. I saw two types of failures:
1) Bad address
2) Operation canceled

I did run (with io_uring) a short test(< 10min with -c 128) and didn't
get those. Could you please share specific tips/workload for
reproducing this?

That's all for today, I hope it helps a little.

-J.

Attachments:

0008.patch_txtapplication/octet-stream; name=0008.patch_txtDownload

From a727635b92e2cfac6dbcf59b05c37eafbfa506f2 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Tue, 4 Nov 2025 10:34:39 +0100
Subject: [PATCH] 0008

---
 src/include/port/pg_numa.h |  2 ++
 src/port/pg_numa.c         | 54 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index aa524f6f7f3..16e4605fa27 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -19,6 +19,8 @@ extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
 extern PGDLLIMPORT Size pg_numa_page_size(void);
 extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
 
 extern PGDLLIMPORT int numa_flags;
 
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 8ee0e7d211c..b90b462ba9b 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -14,13 +14,15 @@
  */
 
 #include "c.h"
+/* XXX: JW: had to add it */
+#include "postgres.h"
 #include <unistd.h>
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
 #include "storage/pg_shmem.h"
 
-int	numa_flags;
+int			numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -139,6 +141,56 @@ pg_numa_move_to_node(char *startptr, char *endptr, int node)
 	numa_tonode_memory(startptr, sz, node);
 }
 
+#ifndef FRONTEND
+/*
+ * The libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+	va_list		ap;
+	int			olde = errno;
+	int			needed;
+	StringInfoData msg;
+
+	initStringInfo(&msg);
+
+	va_start(ap, fmt);
+	needed = appendStringInfoVA(&msg, fmt, ap);
+	va_end(ap);
+	if (needed > 0)
+	{
+		enlargeStringInfo(&msg, needed);
+		va_start(ap, fmt);
+		appendStringInfoVA(&msg, fmt, ap);
+		va_end(ap);
+	}
+
+	ereport(WARNING,
+			(errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+			 errmsg_internal("libnuma: WARNING: %s", msg.data)));
+
+	pfree(msg.data);
+
+	errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+	int			olde = errno;
+
+	/*
+	 * XXX: for now we issue just WARNING, but long-term that might depend on
+	 * numa_set_strict() here.
+	 */
+	elog(WARNING, "libnuma: ERROR: %s", where);
+	errno = olde;
+}
+#endif							/* FRONTEND */
+
 #else
 
 /* Empty wrappers */
-- 
2.39.5

0009.patch_txtapplication/octet-stream; name=0009.patch_txtDownload

From 444e093ea44698f4b57e03220e9680cf149343b6 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Tue, 4 Nov 2025 12:05:13 +0100
Subject: [PATCH] 0009

---
 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 306063e159e..d97fb08e8eb 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -42,7 +42,10 @@ LANGUAGE C PARALLEL SAFE;
 
 -- Create a view for convenient access.
 CREATE VIEW pg_buffercache_pgproc AS
-	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	SELECT P.partition, P.numa_node, P.num_procs,
+	'0x' || to_hex(P.pgproc_ptr) as pgproc_ptr,
+	'0x' || to_hex(P.fastpath_ptr) as fastpath_ptr
+	FROM pg_buffercache_pgproc() AS P
 	(partition integer,
 	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
 
-- 
2.39.5

#70

tomas@vondra.me

2 months ago

In reply to: Jakub Wartak (#69)

Re: Adding basic NUMA awareness

On 11/4/25 13:10, Jakub Wartak wrote:

On Fri, Oct 31, 2025 at 12:57 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

here's a significantly reworked version of this patch series.

I had a couple discussions about these patches at pgconf.eu last week,[..]

I've just had a quick look at this and oh, my, I've started getting
into this partitioned clocksweep and that's ambitious! Yes, this
sequencing of patches makes it much more understandable. Anyway I've
spotted some things, attempted to fix some and have some basic
questions too (so small baby steps, all of this was on 4s/4 NUMA nodes
with HP on) -- the 000X refers to question/issue/bug in specific
patchset file:

0001: you mention 'debug_numa = buffers' in commitmsg, but there's
nothing there like that? it comes with 0006

Right, I forgot to remove that reference.

0002: dunno, but wouldn't it make some educational/debugging sense to
add a debug function returning clocksweep partition index
(calculate_partition_index) for backend? (so that we know which
partition we are working on right now)

Perhaps. I didn't need that, but it might be interesting during
development. I probably would not keep that in the final version.

0003: those two "elog(INFO, "rebalance skipped:" should be at DEBUG2+
IMHO (they are way too verbose during runs)

Agreed.

0006a: Needs update - s/patches later in the patch series/patches
earlier in the patch series/

Agreed.

0006b: IMHO longer term, we should hide some complexity of those calls
via src/port numa shims (pg_numa_sched_cpu()?)

Yeah, there's definitely room for moving more of the code to src/port.

0006c: after GUC commit fce7c73fba4e5, apply complains with:
error: patch failed: src/backend/utils/misc/guc_parameters.dat:906
error: src/backend/utils/misc/guc_parameters.dat: patch does not apply

Will fix.

0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
I've thought it is going to be simpler to use to_hex() -- see 0009
attached.

I don't know. I added simply because it might be useful for development,
but we probably don't want to expose these pointers at all.

0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?

Right. It does not belong to pg_buffercache at all, I just added it
there because I've been messing with that code already.

0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
during start:

2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
num_procs 2523 node 0
2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
num_procs 2523 node 1
2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
num_procs 2523 node 2
2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
num_procs 2523 node 3
2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument

I'll take a look, but I don't recall seeing such errors.

0007d: so we probably need numa_warn()/numa_error() wrappers (this was
initially part of NUMA observability patches but got removed during
the course of action), I'm attaching 0008. With that you'll get
something a little more up to our standards:
2025-11-04 10:27:07.140 CET [59696] DEBUG:
fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
0x7f4f4d4b1660
2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind

Not sure.

0007e: elog DEBUG says it's pg_proc_init_partition but it's
pgproc_partition_init() actually ;)

0007f: The "mbind: Invalid argument"" issue itself with the below addition:
+elog(DEBUG1, "NUMA: fastpath_partition_init ptr %p endptr %p
num_procs %d node %d", ptr, endptr, num_procs, node);
showed this:
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eea00000 endptr 0x7f39eeab1660
num_procs 2523 node 0
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eec00000 endptr 0x7f39eecb1660
num_procs 2523 node 1
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
2025-11-04 11:30:51.089 CET [61841] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f39eee00000 endptr 0x7f39eeeb1660
num_procs 2523 node 2
2025-11-04 11:30:51.089 CET [61841] WARNING: libnuma: ERROR: mbind
[..]

Meanwhile it's full hugepage size (e.g. 0x7f39eec00000−0x7f39eea00000 = 2MB)
$ grep --color 7f39ee[ace] /proc/61841/smaps
7f39ee800000-7f39eea00000 rw-s 87de00000 00:11 122710
/anon_hugepage (deleted)
7f39eea00000-7f39eec00000 rw-s 87e000000 00:11 122710
/anon_hugepage (deleted)
7f39eec00000-7f39eee00000 rw-s 87e200000 00:11 122710
/anon_hugepage (deleted)
7f39eee00000-7f39ef000000 rw-s 87e400000 00:11 122710
/anon_hugepage (deleted)

but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
0xB1660 = 726624 bytes, but if adjust blindly endptr in that
fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
(HP) it doesn't complain anymore and I get success:
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7000000 endptr 0x7f7bf7200000
num_procs 2523 node 0
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7200000 endptr 0x7f7bf7400000
num_procs 2523 node 1
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7400000 endptr 0x7f7bf7600000
num_procs 2523 node 2
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7600000 endptr 0x7f7bf7800000
num_procs 2523 node 3
2025-11-04 12:08:30.147 CET [62352] DEBUG: NUMA:
fastpath_partition_init ptr 0x7f7bf7800000 endptr 0x7f7bf7a00000
num_procs 38 node -1
2025-11-04 12:08:30.239 CET [62352] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit

Hmm, so it seems like another hugepage-related issue. The mbind manpage
says this about "len":

EINVAL An invalid value was specified for flags or mode; or addr + len
was less than addr; or addr is not a multiple of the system page size.

I don't think that requires (addr+len) to be a multiple of page size,
but maybe that is required.

0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) print ptr
$1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x0000561a672a0efe in ExecMakeFunctionResultSet
(fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
isDone=isDone@entry=0x561a97e8e5c0) at
../src/backend/executor/execSRF.c:624
[..]

Postmaster had still attached shm (visible via smaps), and if you
compare closely 0x7f4ed0200000 against sorted smaps:

7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
/anon_hugepage (deleted)
7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
/anon_hugepage (deleted)
7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
/anon_hugepage (deleted)
7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
/anon_hugepage (deleted)
7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
/anon_hugepage (deleted)

it's NOT there at all (there's no mmap region starting with
0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
aware of this new mmaped() regions and instead does simple loop over
all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
os_page_size)"?

I'm confused. How could that mapping be missing? Was this with huge
pages / how many did you reserve on the nodes? Maybe there were not
enough huge pages left on one of the nodes?

I believe I got some SIGBUS in those cases.

0006e:
I'm seeking confirmation, but is this the issue we have discussed
on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
$ numactl --membind="0,1" --cpunodebind="0,1"
/usr/pgsql19/bin/pg_ctl -D /path start
still shows 4 NUMA nodes used. Current patches use
numa_num_configured_nodes(), but it says 'This count includes any
nodes that are currently DISABLED'. So I was wondering if I could help
by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
It's the same as You wrote earlier to Alexy?

If "mems_allowed" refers to nodes allowing memory allocation, then yes,
this would be one way to get into that issue. Oh, is this what happened
in 0006d?

But that's not what you proposed here, clearly. You're saying we should
find which NUMA nodes the process is allowed to run, and use those.
Instead of just using all *configured* nodes. And I agree with that.

So are you already on it ?

There are a couple unsolved issues, though. While running the tests, I
ran into a bunch of weird issues. I saw two types of failures:
1) Bad address
2) Operation canceled

I did run (with io_uring) a short test(< 10min with -c 128) and didn't
get those. Could you please share specific tips/workload for
reproducing this?

I did get a couple of "operation canceled" failures, but only on fairly
old kernel versions (6.1 which came as default with the VM). I heard
some suggestions this is a bug in older kernels - I don't have any link
to a bug report / fix, though. But I've been unable to reproduce this on
6.17, so maybe it's true.

For me the failures always happened 10 seconds after the start of the
benchmark (and starting the instance), so it's probably sufficient to
keep the runs ~20 seconds (and maybe restart in between?).

But even then it's fairly rare. I've seen ~10 failures for 500 runs.

I haven't seen more "bad address" cases, I have no idea why. I'm still
guessing it's related to huge pages, so maybe I happened to reserve
enough of them.

regards

--
Tomas Vondra

#71

[1]: /messages/by-id/CAKZiRmzxj6Lt1w2ffDoUmN533TgyDeYVULEH1PQFLRyBJSFP6w@mail.gmail.com

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Tomas Vondra (#70)

Re: Adding basic NUMA awareness

On Tue, Nov 4, 2025 at 10:21 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
I've thought it is going to be simpler to use to_hex() -- see 0009
attached.

I don't know. I added simply because it might be useful for development,
but we probably don't want to expose these pointers at all.

0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?

Right. It does not belong to pg_buffercache at all, I just added it
there because I've been messing with that code already.

Please keep them in for at least for some time (perhaps standalone
patch marked as not intended to be commited would work?). I find the
view extermely useful as it will allow us pinpointing local-vs-remote
NUMA fetches (we need to know the addres).

0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
during start:

2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
num_procs 2523 node 0
2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
num_procs 2523 node 1
2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
num_procs 2523 node 2
2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
num_procs 2523 node 3
2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument

I'll take a look, but I don't recall seeing such errors.

Alexy also reported this earlier, here
/messages/by-id/92e23c85-f646-4bab-b5e0-df30d8ddf4bd@postgrespro.ru
(just use HP, set some high max_connections). I've double checked this
too , numa_tonode_memory() len needs to HP size.

0007d: so we probably need numa_warn()/numa_error() wrappers (this was
initially part of NUMA observability patches but got removed during
the course of action), I'm attaching 0008. With that you'll get
something a little more up to our standards:
2025-11-04 10:27:07.140 CET [59696] DEBUG:
fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
0x7f4f4d4b1660
2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind

Not sure.

Any particular objections? We need to somehow emit them into the logs.

0007f: The "mbind: Invalid argument"" issue itself with the below addition:

[..]

but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
0xB1660 = 726624 bytes, but if adjust blindly endptr in that
fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
(HP) it doesn't complain anymore and I get success:

[..]

Hmm, so it seems like another hugepage-related issue. The mbind manpage
says this about "len":

EINVAL An invalid value was specified for flags or mode; or addr + len
was less than addr; or addr is not a multiple of the system page size.

I don't think that requires (addr+len) to be a multiple of page size,
but maybe that is required.

I do think that 'system page size' means above HP page size, but this
time it's just for fastpath_partition_init(), the earlier one seems to
aligned fine (?? -- i havent really checked but there's no error)

0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) print ptr
$1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x0000561a672a0efe in ExecMakeFunctionResultSet
(fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
isDone=isDone@entry=0x561a97e8e5c0) at
../src/backend/executor/execSRF.c:624
[..]

Postmaster had still attached shm (visible via smaps), and if you
compare closely 0x7f4ed0200000 against sorted smaps:

7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
/anon_hugepage (deleted)
7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
/anon_hugepage (deleted)
7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
/anon_hugepage (deleted)
7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
/anon_hugepage (deleted)
7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
/anon_hugepage (deleted)

it's NOT there at all (there's no mmap region starting with
0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
aware of this new mmaped() regions and instead does simple loop over
all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
os_page_size)"?

I'm confused. How could that mapping be missing? Was this with huge
pages / how many did you reserve on the nodes?

OK I made and error and paritally got it correct (it crashes reliably)
and partially mislead You, appologies, let me explain. There were two
questions for me:
a) why we make single mmap() and after numa_tonode_memory() we get
plenty of mappings
b) why we get SIGBUS (I've thought they are not continus, but they are
after triple-checking)

ad a) My testing shows that on HP,as stated initially ("all of this
was on 4s/4 NUMA nodes with HP on"). That's what the codes does, you
get single mmaps() (resulting in single entry in smaps), but afte
noda_tonode_memory() there's many of them. Even on laptop:

System has 1 NUMA nodes (0 to 0).
Attempting to allocate 8.000000 MB of HugeTLB memory...
Successfully allocated HugeTLB memory at 0x755828800000, smaps before:
755828800000-755829000000 rw-s 00000000 00:11 259808
/anon_hugepage (deleted)
Pinning first part (from 0x755828800000) to NUMA node 0...
smaps after:
755828800000-755828c00000 rw-s 00000000 00:11 259808
/anon_hugepage (deleted)
755828c00000-755829000000 rw-s 00400000 00:11 259808
/anon_hugepage (deleted)
Pinning second part (from 0x755828c00000) to NUMA node 0...
smaps after:
755828800000-755828c00000 rw-s 00000000 00:11 259808
/anon_hugepage (deleted)
755828c00000-755829000000 rw-s 00400000 00:11 259808
/anon_hugepage (deleted)

It gets even more funny, below I have 8MB HP=on, but just issue 2x
numa_tonode_memory(for len 2MB on 4MB ptr to node0) (two times for
ptr, second time in half of that):

System has 1 NUMA nodes (0 to 0).
Attempting to allocate 8.000000 MB of HugeTLB memory...
Successfully allocated HugeTLB memory at 0x7302dda00000, smaps before:
7302dda00000-7302de200000 rw-s 00000000 00:11 284859
/anon_hugepage (deleted)
Pinning first part (from 0x7302dda00000) to NUMA node 0...
smaps after:
7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
/anon_hugepage (deleted)
7302ddc00000-7302de200000 rw-s 00200000 00:11 284859
/anon_hugepage (deleted)
Pinning second part (from 0x7302dde00000) to NUMA node 0...
smaps after:
7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
/anon_hugepage (deleted)
7302ddc00000-7302dde00000 rw-s 00200000 00:11 284859
/anon_hugepage (deleted)
7302dde00000-7302de000000 rw-s 00400000 00:11 284859
/anon_hugepage (deleted)
7302de000000-7302de200000 rw-s 00600000 00:11 284859
/anon_hugepage (deleted)

Why 4 instead of 1? Because some mappings are now "default" becauswe
their policy was not altered:

$ grep huge /proc/$(pidof testnumammapsplit)/numa_maps
7302dda00000 bind:0 file=/anon_hugepage\040(deleted) huge
7302ddc00000 default file=/anon_hugepage\040(deleted) huge
7302dde00000 bind:0 file=/anon_hugepage\040(deleted) huge
7302de000000 default file=/anon_hugepage\040(deleted) huge

Back to originnal error, they are consecutive regions and earlier problem is

error: 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
start: 0x7f4921400000
end: 0x7f4f4c000000

so it fits into that range (that was my mistate earlier, using just
grep not checking are they really within that), but...

Maybe there were not enough huge pages left on one of the nodes?

ad b) right, something like that. I've investigated that SIGBUS there
(it's going to be long):

with shared_buffers=32GB, huge_pages 17715 (+1 from what postgres -C
shared_memory_size_in_huge_pages returns), right after startup, but no
touch:

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x00005571f54ddb7d in ExecMakeTableFunctionResult
(setexpr=0x557203870d40, econtext=0x557203870ba8,
argContext=<optimized out>, expectedDesc=0x557203870f80,
randomAccess=false) at ../src/backend/executor/execSRF.c:234
[..]
(gdb) print ptr
$1 = 0x7f6cf8400000 <error: Cannot access memory at address 0x7f6cf8400000>
(gdb)

then it shows?! no available hugepage on one of the nodes (while gdb
is hanging and preving autorestart):

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 299
node1/meminfo:Node 1 HugePages_Free: 299
node2/meminfo:Node 2 HugePages_Free: 299
node3/meminfo:Node 3 HugePages_Free: 0

but they are equal in terms of size:
node0/meminfo:Node 0 HugePages_Total: 4429
node1/meminfo:Node 1 HugePages_Total: 4429
node2/meminfo:Node 2 HugePages_Total: 4429
node3/meminfo:Node 3 HugePages_Total: 4428

smaps shows that this address (7f6cf8400000) is mapped in this mapping:
7f6b49c00000-7f6d49c00000 rw-s 652600000 00:11 86064
/anon_hugepage (deleted)

numa_maps for this region shows this is this mapping on node3 (notice
N3 + bind:3 matches lack of memory on Node 3 HugePAges_Free):
7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
N3=3444 kernelpagesize_kB=2048

the surrounding area of this looks like that:

7f6549c00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=4096
N0=4096 kernelpagesize_kB=2048
7f6749c00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=4096
N1=4096 kernelpagesize_kB=2048
7f6949c00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=4096
N2=4096 kernelpagesize_kB=2048
7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
N3=3444 kernelpagesize_kB=2048 <-- this is the one
7f6d49c00000 default file=/anon_hugepage\040(deleted) huge dirty=107
mapmax=6 N3=107 kernelpagesize_kB=2048

Notice it's just N3=3444, while the others are much larger. So
something was using that hugepages memory on N3:

# grep kernelpagesize_kB=2048 /proc/1679/numa_maps | grep -Po
N[0-4]=[0-9]+ | sort
N0=2
N0=4096
N1=2
N1=4096
N2=2
N2=4096
N3=1
N3=1
N3=1
N3=1
N3=107
N3=13
N3=3
N3=3444

So per above it's not there (at least not as 2MB HP). But the number
of mappings is wild there! (node where it is failing has plenty of
memory, no hugepage memory left, but it has like 40k+ of small
mappings!)

# grep -Po 'N[0-3]=' /proc/1679/numa_maps | sort | uniq -c
17 N0=
10 N1=
3 N2=
40434 N3=

most of them are `anon_inode:[io_uring]` (and I had
max_connections=10k). You may ask why in spite of Andres optimization
for reducing number segments for uring, it's not working for me ? Well
I've just noticed way too silent failure to active this (altough I'm
on 6.14.x):
2025-11-06 13:34:49.128 CET [1658] DEBUG: can't use combined
memory mapping for io_uring, kernel or liburing too old
and I dont have io_uring_queue_init_mem()/HAVE_LIBURING_QUEUE_INIT_MEM
apparently on liburing-2.3 (Debian's default). See [1]/messages/by-id/CAKZiRmzxj6Lt1w2ffDoUmN533TgyDeYVULEH1PQFLRyBJSFP6w@mail.gmail.com for more info
(fix is not commited yet sadly).

Next try, now with io_method = worker and right before start:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Total
node*/meminfo
node0/meminfo:Node 0 HugePages_Total: 4429
node1/meminfo:Node 1 HugePages_Total: 4429
node2/meminfo:Node 2 HugePages_Total: 4429
node3/meminfo:Node 3 HugePages_Total: 4428
and HugePages_Free were 100% (if postgresql was down). After start
(but without doing anything else):
root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 4393
node1/meminfo:Node 1 HugePages_Free: 4395
node2/meminfo:Node 2 HugePages_Free: 4395
node3/meminfo:Node 3 HugePages_Free: 3446

So sadly the picture is the same (something stole my HP on N3 and it's
PostgreSQL on it's own). After some time of investigating that ("who
stole my hugepage across whole OS"), I've just added MAP_POPULATE to
the mix of PG_MMAP_FLAGS and got this after start:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 0
node1/meminfo:Node 1 HugePages_Free: 0
node2/meminfo:Node 2 HugePages_Free: 0
node3/meminfo:Node 3 HugePages_Free: 1

and then the SELECT to pg_buffercache_numa works fine(!).

Another ways that I have found to eliminate that SIGBUS
a. Would be to throw much more HugePages (so that node does not run to
HugePages_Free), but that's not real option.
b. Then I've reminded myself that I could be running custom kernel
with experimental CONFIG_READ_ONLY_THP_FOR_FS (to reduce iTLB misses
tranparently with specially linked PG; will double check exact stuff
later), so I've thrown never into
/sys/kernel/mm/transparent_hugepage/enabled and defrag too (yes ,
disabled THP) and with that -- drumroll -- that SELECT works. The very
same PG picture after startup (where earlier it would crash), now
after SELECT it looks like that:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 83
node1/meminfo:Node 1 HugePages_Free: 0
node2/meminfo:Node 2 HugePages_Free: 81
node3/meminfo:Node 3 HugePages_Free: 82

Hope that helps a little. To me it sounds like THP used that memory
somehow and we've also wanted to use. With numa_interleave_ptr() that
wouldn't be a problem because probably it would something else
available, but not here as we indicated exact node.

0006e:
I'm seeking confirmation, but is this the issue we have discussed
on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
$ numactl --membind="0,1" --cpunodebind="0,1"
/usr/pgsql19/bin/pg_ctl -D /path start
still shows 4 NUMA nodes used. Current patches use
numa_num_configured_nodes(), but it says 'This count includes any
nodes that are currently DISABLED'. So I was wondering if I could help
by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
It's the same as You wrote earlier to Alexy?

If "mems_allowed" refers to nodes allowing memory allocation, then yes,
this would be one way to get into that issue. Oh, is this what happened
in 0006d?

OK, thanks for confirmation. No, 0006d was about normal numactl run,
without --membind.

I did get a couple of "operation canceled" failures, but only on fairly
old kernel versions (6.1 which came as default with the VM).

OK, I'll try to see that later too.

btw QQ regarding partitioned clockwise as I had thought: does this
opens a road towards multiple BGwriters? (outside of this
$thread/v1/PoC)

-J.

#72

tomas@vondra.me

2 months ago

In reply to: Jakub Wartak (#71)

7 attachment(s)

Re: Adding basic NUMA awareness

Hi,

here's a rebased patch series, fixing most of the smaller issues from
v20251101, and making cfbot happy (hopefully).

On 11/6/25 15:02, Jakub Wartak wrote:

On Tue, Nov 4, 2025 at 10:21 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
I've thought it is going to be simpler to use to_hex() -- see 0009
attached.

I don't know. I added simply because it might be useful for development,
but we probably don't want to expose these pointers at all.

0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?

Right. It does not belong to pg_buffercache at all, I just added it
there because I've been messing with that code already.

Please keep them in for at least for some time (perhaps standalone
patch marked as not intended to be commited would work?). I find the
view extermely useful as it will allow us pinpointing local-vs-remote
NUMA fetches (we need to know the addres).

Are you referring to the _pgproc view specifically, or also to the view
with buffer partitions? I don't intend to remove the view for shared
buffers, that's indeed useful.

0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
during start:

2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
num_procs 2523 node 0
2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
num_procs 2523 node 1
2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
num_procs 2523 node 2
2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
num_procs 2523 node 3
2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA:
pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument

I'll take a look, but I don't recall seeing such errors.

Alexy also reported this earlier, here
/messages/by-id/92e23c85-f646-4bab-b5e0-df30d8ddf4bd@postgrespro.ru
(just use HP, set some high max_connections). I've double checked this
too , numa_tonode_memory() len needs to HP size.

OK, I'll investigate this.

0007d: so we probably need numa_warn()/numa_error() wrappers (this was
initially part of NUMA observability patches but got removed during
the course of action), I'm attaching 0008. With that you'll get
something a little more up to our standards:
2025-11-04 10:27:07.140 CET [59696] DEBUG:
fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
0x7f4f4d4b1660
2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind

Not sure.

Any particular objections? We need to somehow emit them into the logs.

No idea, I think it'd be better to make sure this failure can't happen,
but maybe it's not possible. I don't understand the mbind failure well
enough.

0007f: The "mbind: Invalid argument"" issue itself with the below addition:

[..]

but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
0xB1660 = 726624 bytes, but if adjust blindly endptr in that
fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
(HP) it doesn't complain anymore and I get success:

[..]

Hmm, so it seems like another hugepage-related issue. The mbind manpage
says this about "len":

EINVAL An invalid value was specified for flags or mode; or addr + len
was less than addr; or addr is not a multiple of the system page size.

I don't think that requires (addr+len) to be a multiple of page size,
but maybe that is required.

I do think that 'system page size' means above HP page size, but this
time it's just for fastpath_partition_init(), the earlier one seems to
aligned fine (?? -- i havent really checked but there's no error)

Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC
partitioning, I don't think that's likely to go into 19.

0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) print ptr
$1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x0000561a672a0efe in ExecMakeFunctionResultSet
(fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
isDone=isDone@entry=0x561a97e8e5c0) at
../src/backend/executor/execSRF.c:624
[..]

Postmaster had still attached shm (visible via smaps), and if you
compare closely 0x7f4ed0200000 against sorted smaps:

7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
/anon_hugepage (deleted)
7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
/anon_hugepage (deleted)
7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
/anon_hugepage (deleted)
7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
/anon_hugepage (deleted)
7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
/anon_hugepage (deleted)

it's NOT there at all (there's no mmap region starting with
0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
aware of this new mmaped() regions and instead does simple loop over
all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
os_page_size)"?

I'm confused. How could that mapping be missing? Was this with huge
pages / how many did you reserve on the nodes?

OK I made and error and paritally got it correct (it crashes reliably)
and partially mislead You, appologies, let me explain. There were two
questions for me:
a) why we make single mmap() and after numa_tonode_memory() we get
plenty of mappings
b) why we get SIGBUS (I've thought they are not continus, but they are
after triple-checking)

ad a) My testing shows that on HP,as stated initially ("all of this
was on 4s/4 NUMA nodes with HP on"). That's what the codes does, you
get single mmaps() (resulting in single entry in smaps), but afte
noda_tonode_memory() there's many of them. Even on laptop:

System has 1 NUMA nodes (0 to 0).
Attempting to allocate 8.000000 MB of HugeTLB memory...
Successfully allocated HugeTLB memory at 0x755828800000, smaps before:
755828800000-755829000000 rw-s 00000000 00:11 259808
/anon_hugepage (deleted)
Pinning first part (from 0x755828800000) to NUMA node 0...
smaps after:
755828800000-755828c00000 rw-s 00000000 00:11 259808
/anon_hugepage (deleted)
755828c00000-755829000000 rw-s 00400000 00:11 259808
/anon_hugepage (deleted)
Pinning second part (from 0x755828c00000) to NUMA node 0...
smaps after:
755828800000-755828c00000 rw-s 00000000 00:11 259808
/anon_hugepage (deleted)
755828c00000-755829000000 rw-s 00400000 00:11 259808
/anon_hugepage (deleted)

It gets even more funny, below I have 8MB HP=on, but just issue 2x
numa_tonode_memory(for len 2MB on 4MB ptr to node0) (two times for
ptr, second time in half of that):

System has 1 NUMA nodes (0 to 0).
Attempting to allocate 8.000000 MB of HugeTLB memory...
Successfully allocated HugeTLB memory at 0x7302dda00000, smaps before:
7302dda00000-7302de200000 rw-s 00000000 00:11 284859
/anon_hugepage (deleted)
Pinning first part (from 0x7302dda00000) to NUMA node 0...
smaps after:
7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
/anon_hugepage (deleted)
7302ddc00000-7302de200000 rw-s 00200000 00:11 284859
/anon_hugepage (deleted)
Pinning second part (from 0x7302dde00000) to NUMA node 0...
smaps after:
7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
/anon_hugepage (deleted)
7302ddc00000-7302dde00000 rw-s 00200000 00:11 284859
/anon_hugepage (deleted)
7302dde00000-7302de000000 rw-s 00400000 00:11 284859
/anon_hugepage (deleted)
7302de000000-7302de200000 rw-s 00600000 00:11 284859
/anon_hugepage (deleted)

Why 4 instead of 1? Because some mappings are now "default" becauswe
their policy was not altered:

$ grep huge /proc/$(pidof testnumammapsplit)/numa_maps
7302dda00000 bind:0 file=/anon_hugepage\040(deleted) huge
7302ddc00000 default file=/anon_hugepage\040(deleted) huge
7302dde00000 bind:0 file=/anon_hugepage\040(deleted) huge
7302de000000 default file=/anon_hugepage\040(deleted) huge

Back to originnal error, they are consecutive regions and earlier problem is

error: 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
start: 0x7f4921400000
end: 0x7f4f4c000000

so it fits into that range (that was my mistate earlier, using just
grep not checking are they really within that), but...

Maybe there were not enough huge pages left on one of the nodes?

ad b) right, something like that. I've investigated that SIGBUS there
(it's going to be long):

with shared_buffers=32GB, huge_pages 17715 (+1 from what postgres -C
shared_memory_size_in_huge_pages returns), right after startup, but no
touch:

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386 pg_numa_touch_mem_if_required(ptr);
(gdb) where
#0 pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1 0x00005571f54ddb7d in ExecMakeTableFunctionResult
(setexpr=0x557203870d40, econtext=0x557203870ba8,
argContext=<optimized out>, expectedDesc=0x557203870f80,
randomAccess=false) at ../src/backend/executor/execSRF.c:234
[..]
(gdb) print ptr
$1 = 0x7f6cf8400000 <error: Cannot access memory at address 0x7f6cf8400000>
(gdb)

then it shows?! no available hugepage on one of the nodes (while gdb
is hanging and preving autorestart):

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 299
node1/meminfo:Node 1 HugePages_Free: 299
node2/meminfo:Node 2 HugePages_Free: 299
node3/meminfo:Node 3 HugePages_Free: 0

but they are equal in terms of size:
node0/meminfo:Node 0 HugePages_Total: 4429
node1/meminfo:Node 1 HugePages_Total: 4429
node2/meminfo:Node 2 HugePages_Total: 4429
node3/meminfo:Node 3 HugePages_Total: 4428

smaps shows that this address (7f6cf8400000) is mapped in this mapping:
7f6b49c00000-7f6d49c00000 rw-s 652600000 00:11 86064
/anon_hugepage (deleted)

numa_maps for this region shows this is this mapping on node3 (notice
N3 + bind:3 matches lack of memory on Node 3 HugePAges_Free):
7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
N3=3444 kernelpagesize_kB=2048

the surrounding area of this looks like that:

7f6549c00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=4096
N0=4096 kernelpagesize_kB=2048
7f6749c00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=4096
N1=4096 kernelpagesize_kB=2048
7f6949c00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=4096
N2=4096 kernelpagesize_kB=2048
7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
N3=3444 kernelpagesize_kB=2048 <-- this is the one
7f6d49c00000 default file=/anon_hugepage\040(deleted) huge dirty=107
mapmax=6 N3=107 kernelpagesize_kB=2048

Notice it's just N3=3444, while the others are much larger. So
something was using that hugepages memory on N3:

# grep kernelpagesize_kB=2048 /proc/1679/numa_maps | grep -Po
N[0-4]=[0-9]+ | sort
N0=2
N0=4096
N1=2
N1=4096
N2=2
N2=4096
N3=1
N3=1
N3=1
N3=1
N3=107
N3=13
N3=3
N3=3444

So per above it's not there (at least not as 2MB HP). But the number
of mappings is wild there! (node where it is failing has plenty of
memory, no hugepage memory left, but it has like 40k+ of small
mappings!)

# grep -Po 'N[0-3]=' /proc/1679/numa_maps | sort | uniq -c
17 N0=
10 N1=
3 N2=
40434 N3=

most of them are `anon_inode:[io_uring]` (and I had
max_connections=10k). You may ask why in spite of Andres optimization
for reducing number segments for uring, it's not working for me ? Well
I've just noticed way too silent failure to active this (altough I'm
on 6.14.x):
2025-11-06 13:34:49.128 CET [1658] DEBUG: can't use combined
memory mapping for io_uring, kernel or liburing too old
and I dont have io_uring_queue_init_mem()/HAVE_LIBURING_QUEUE_INIT_MEM
apparently on liburing-2.3 (Debian's default). See [1] for more info
(fix is not commited yet sadly).

Next try, now with io_method = worker and right before start:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Total
node*/meminfo
node0/meminfo:Node 0 HugePages_Total: 4429
node1/meminfo:Node 1 HugePages_Total: 4429
node2/meminfo:Node 2 HugePages_Total: 4429
node3/meminfo:Node 3 HugePages_Total: 4428
and HugePages_Free were 100% (if postgresql was down). After start
(but without doing anything else):
root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 4393
node1/meminfo:Node 1 HugePages_Free: 4395
node2/meminfo:Node 2 HugePages_Free: 4395
node3/meminfo:Node 3 HugePages_Free: 3446

So sadly the picture is the same (something stole my HP on N3 and it's
PostgreSQL on it's own). After some time of investigating that ("who
stole my hugepage across whole OS"), I've just added MAP_POPULATE to
the mix of PG_MMAP_FLAGS and got this after start:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 0
node1/meminfo:Node 1 HugePages_Free: 0
node2/meminfo:Node 2 HugePages_Free: 0
node3/meminfo:Node 3 HugePages_Free: 1

and then the SELECT to pg_buffercache_numa works fine(!).

Another ways that I have found to eliminate that SIGBUS
a. Would be to throw much more HugePages (so that node does not run to
HugePages_Free), but that's not real option.
b. Then I've reminded myself that I could be running custom kernel
with experimental CONFIG_READ_ONLY_THP_FOR_FS (to reduce iTLB misses
tranparently with specially linked PG; will double check exact stuff
later), so I've thrown never into
/sys/kernel/mm/transparent_hugepage/enabled and defrag too (yes ,
disabled THP) and with that -- drumroll -- that SELECT works. The very
same PG picture after startup (where earlier it would crash), now
after SELECT it looks like that:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free: 83
node1/meminfo:Node 1 HugePages_Free: 0
node2/meminfo:Node 2 HugePages_Free: 81
node3/meminfo:Node 3 HugePages_Free: 82

Hope that helps a little. To me it sounds like THP used that memory
somehow and we've also wanted to use. With numa_interleave_ptr() that
wouldn't be a problem because probably it would something else
available, but not here as we indicated exact node.

0006e:
I'm seeking confirmation, but is this the issue we have discussed
on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
$ numactl --membind="0,1" --cpunodebind="0,1"
/usr/pgsql19/bin/pg_ctl -D /path start
still shows 4 NUMA nodes used. Current patches use
numa_num_configured_nodes(), but it says 'This count includes any
nodes that are currently DISABLED'. So I was wondering if I could help
by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
It's the same as You wrote earlier to Alexy?

If "mems_allowed" refers to nodes allowing memory allocation, then yes,
this would be one way to get into that issue. Oh, is this what happened
in 0006d?

OK, thanks for confirmation. No, 0006d was about normal numactl run,
without --membind.

I didn't have time to look into all this info about mappings, io_uring
yet, so no response from me.

I did get a couple of "operation canceled" failures, but only on fairly
old kernel versions (6.1 which came as default with the VM).

OK, I'll try to see that later too.

btw QQ regarding partitioned clockwise as I had thought: does this
opens a road towards multiple BGwriters? (outside of this
$thread/v1/PoC)

I don't think the clocksweep partitioning is required for multiple
bgwriters, but it might make it easier.

regards

--
Tomas Vondra

Attachments:

v20251111-0007-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20251111-0007-NUMA-partition-PGPROC.patchDownload

From 7f8096d294ed9d69c953ff97be7f46db058cd2cb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:10:03 +0100
Subject: [PATCH v20251111 7/7] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  96 ++-
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 ++-
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 551 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 723 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index dc2ce019283..306063e159e 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -33,3 +33,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 1379d54cc5d..f340fcf420c 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -29,7 +30,8 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -931,3 +934,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index ea43b432daf..7d589bac115 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -575,7 +575,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -634,7 +634,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89d0bfa7760..e0e17293536 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 976638a58ac..5e7b0ac8850 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -107,8 +107,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ce6b5299324..3288900bb6f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -292,7 +292,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index c4a888a081c..f5844aa5b6a 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 587859a5754..bdcdbcc6b5f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -799,6 +799,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 810a549efce..0937292643f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -472,7 +472,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4cc7f645c31..d01f486876d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1504fafe6d8..0a0ce98b725 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,22 +29,33 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -77,8 +88,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -91,6 +102,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -101,11 +135,41 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
 	return size;
 }
 
@@ -130,6 +194,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -141,6 +259,9 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
@@ -149,6 +270,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
 
+	size = add_size(size, PGProcPartitionsShmemSize());
+
 	return size;
 }
 
@@ -193,7 +316,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -212,6 +335,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -226,6 +352,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,21 +376,110 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
-	MemSet(ptr, 0, requestSize);
-
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			/* should have been aligned */
+			Assert(ptr == (char *) TYPEALIGN(numa_page_size, ptr));
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
+	/*
+	 * Don't memset the memory before locating it to NUMA nodes (which requires
+	 * the pages to be allocated but not yet faulted in memory).
+	 */
+	MemSet(ptr, 0, requestSize);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -291,23 +515,91 @@ InitProcGlobal(void)
 	/* Reserve space for semaphores. */
 	PGReserveSemaphores(ProcGlobalSemas());
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
 
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
@@ -371,9 +663,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -440,7 +729,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -651,7 +984,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1059,7 +1392,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1108,7 +1441,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1998,7 +2331,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2073,3 +2406,173 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ *
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+	{
+		/* align the pointer to the next page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+	}
+
+	elog(DEBUG1, "NUMA: pgproc_partition_init procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cb52c417592..b11b04818be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1881,6 +1881,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.1

v20251111-0006-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251111-0006-NUMA-shared-buffers-partitioning.patchDownload

From 705ce91522059afc98ff339cc640ab7aa38d4ac7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:05:35 +0100
Subject: [PATCH v20251111 6/7] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.
---
 .../pg_buffercache--1.6--1.7.sql              |   1 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  44 +-
 src/backend/storage/buffer/buf_init.c         | 569 +++++++++++++++++-
 src/backend/storage/buffer/freelist.c         |  88 ++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  14 +-
 src/include/storage/bufmgr.h                  |   4 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 11 files changed, 736 insertions(+), 68 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 2c4d560514d..dc2ce019283 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -13,6 +13,7 @@ LANGUAGE C PARALLEL SAFE;
 CREATE VIEW pg_buffercache_partitions AS
 	SELECT P.* FROM pg_buffercache_partitions() AS P
 	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
 	 last_buffer integer,		-- last buffer of partition
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8536e2debef..1379d54cc5d 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -813,19 +813,21 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
 						   INT8OID, -1, 0);
@@ -849,7 +851,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	{
 		uint32		i = funcctx->call_cntr;
 
-		int			num_buffers,
+		int			numa_node,
+					num_buffers,
 					first_buffer,
 					last_buffer;
 
@@ -868,7 +871,7 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
-		BufferPartitionGet(i, &num_buffers,
+		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
@@ -886,36 +889,39 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
-		values[1] = Int32GetDatum(num_buffers);
+		values[1] = Int32GetDatum(numa_node);
 		nulls[1] = false;
 
-		values[2] = Int32GetDatum(first_buffer);
+		values[2] = Int32GetDatum(num_buffers);
 		nulls[2] = false;
 
-		values[3] = Int32GetDatum(last_buffer);
+		values[3] = Int32GetDatum(first_buffer);
 		nulls[3] = false;
 
-		values[4] = Int64GetDatum(complete_passes);
+		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
-		values[5] = Int32GetDatum(next_victim_buffer);
+		values[5] = Int64GetDatum(complete_passes);
 		nulls[5] = false;
 
-		values[6] = Int64GetDatum(buffer_total_allocs);
+		values[6] = Int32GetDatum(next_victim_buffer);
 		nulls[6] = false;
 
-		values[7] = Int64GetDatum(buffer_allocs);
+		values[7] = Int64GetDatum(buffer_total_allocs);
 		nulls[7] = false;
 
-		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
-		values[9] = Int64GetDatum(buffer_req_allocs);
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
 		nulls[9] = false;
 
-		values[10] = PointerGetDatum(array);
+		values[10] = Int64GetDatum(buffer_req_allocs);
 		nulls[10] = false;
 
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0362fda24aa..587859a5754 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,12 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -29,15 +35,24 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-/* *
- * number of buffer partitions */
-#define NUM_CLOCK_SWEEP_PARTITIONS	4
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
 
 /* Array of structs with information about buffer ranges */
 BufferPartitions *BufferPartitionsArray = NULL;
 
+static void buffer_partitions_prepare(void);
 static void buffer_partitions_init(void);
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
+
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -85,25 +100,85 @@ BufferManagerShmemInit(void)
 				foundIOCV,
 				foundBufCkpt,
 				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
 	/* allocate the partition registry first */
 	BufferPartitionsArray = (BufferPartitions *)
 		ShmemInitStruct("Buffer Partitions",
 						offsetof(BufferPartitions, partitions) +
-						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						mul_size(sizeof(BufferPartition), numa_partitions),
 						&foundParts);
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -133,7 +208,10 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
-		/* Initialize buffer partitions (calculate buffer ranges). */
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
 		buffer_partitions_init();
 
 		/*
@@ -172,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -201,11 +286,244 @@ BufferManagerShmemSize(void)
 
 	/* account for registry of NUMA partitions */
 	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
-								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
 
 	return size;
 }
 
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	/* no NUMA-aware partitioning */
+	if ((numa_flags & NUMA_BUFFERS) == 0)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches earlier in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(NOTICE, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+	else if ((numa_flags & NUMA_BUFFERS) == 0)
+	{
+		elog(NOTICE, "NUMA-partitioning of buffers disabled");
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or when we can't partition for some
+	 * reason, just take a "fair share" of buffers. This can happen for a
+	 * number of reasons - missing NUMA support, partitioning of buffers not
+	 * enabled, or not enough buffers for this many nodes.
+	 *
+	 * We still build partitions, because we want to allow partitioning of
+	 * the clock-sweep later.
+	 *
+	 * The number of buffers for each partition is calculated later, once we
+	 * have allocated the shared memory (because that's where we store it).
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
 /*
  * Sanity checks of buffers partitions - there must be no gaps, it must cover
  * the whole range of buffers, etc.
@@ -267,33 +585,137 @@ buffer_partitions_init(void)
 {
 	int			remaining_buffers = NBuffers;
 	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
 
-	/* number of buffers per partition (make sure to not overflow) */
-	int			part_buffers
-		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
-
-	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
 
-	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	for (int n = 0; n < numa_nodes; n++)
 	{
-		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
 
-		/* buffers this partition should get (last partition can get fewer) */
-		int			num_buffers = Min(remaining_buffers, part_buffers);
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
 
-		remaining_buffers -= num_buffers;
+		remaining_buffers -= node_buffers;
 
-		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
-		Assert((buffer >= 0) && (buffer < NBuffers));
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
 
-		part->num_buffers = num_buffers;
-		part->first_buffer = buffer;
-		part->last_buffer = buffer + (num_buffers - 1);
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
 
-		buffer += num_buffers;
+			/* XXX we should get the actual node ID from the mask */
+			if (numa_can_partition)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
 	}
 
 	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not map pages
+	 * one by one.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 *
+	 * We always map all partitions for the same node at once, so that we
+	 * don't need to worry about alignment of memory pages that get split
+	 * between partitions (we only worry about min_node_buffers for whole
+	 * NUMA nodes, not for individual partitions).
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		char	   *startptr,
+				   *endptr;
+		int			num_buffers = 0;
+
+		/* sum buffers in all partitions for this node */
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int		pidx = (n * parts_per_node + p);
+			BufferPartition *part = &BufferPartitionsArray->partitions[pidx];
+
+			Assert(part->numa_node == n);
+
+			num_buffers += part->num_buffers;
+		}
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => buffers %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => descriptors %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
 }
 
 int
@@ -302,14 +724,21 @@ BufferPartitionCount(void)
 	return BufferPartitionsArray->npartitions;
 }
 
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
 void
-BufferPartitionGet(int idx, int *num_buffers,
+BufferPartitionGet(int idx, int *node, int *num_buffers,
 				   int *first_buffer, int *last_buffer)
 {
 	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
 	{
 		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
 
+		*node = part->numa_node;
 		*num_buffers = part->num_buffers;
 		*first_buffer = part->first_buffer;
 		*last_buffer = part->last_buffer;
@@ -322,8 +751,82 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 /* return parameters before the partitions are initialized (during sizing) */
 void
-BufferPartitionParams(int *num_partitions)
+BufferPartitionParams(int *num_partitions, int *num_nodes)
 {
 	if (num_partitions)
-		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 8be77a9c8b1..810a549efce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -124,7 +124,9 @@ typedef struct
 	//int			__attribute__((aligned(64))) bgwprocno;
 
 	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	/* clocksweep partitions */
 	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
@@ -270,16 +272,72 @@ ClockSweepTick(ClockSweep *sweep)
  * calculate_partition_index
  *		calculate the buffer / clock-sweep partition to use
  *
- * use PID to determine the buffer partition
- *
- * XXX We could use NUMA node / core ID to pick partition, but we'd need
- * to handle cases with fewer nodes/cores than partitions somehow. Although,
- * maybe the balancing would handle that too.
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
  */
 static int
 calculate_partition_index(void)
 {
-	return (MyProcPid % StrategyControl->num_partitions);
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
 }
 
 /*
@@ -947,7 +1005,7 @@ StrategyShmemSize(void)
 	Size		size = 0;
 	int			num_partitions;
 
-	BufferPartitionParams(&num_partitions);
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -974,9 +1032,17 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
@@ -1011,7 +1077,8 @@ StrategyInitialize(bool init)
 		/* Initialize the clock sweep pointers (for all partitions) */
 		for (int i = 0; i < num_partitions; i++)
 		{
-			int			num_buffers,
+			int			node,
+						num_buffers,
 						first_buffer,
 						last_buffer;
 
@@ -1020,7 +1087,8 @@ StrategyInitialize(bool init)
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
 			/* get info about the buffer partition */
-			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
 
 			/*
 			 * FIXME This may not quite right, because if NBuffers is not a
@@ -1056,6 +1124,8 @@ StrategyInitialize(bool init)
 
 		/* initialize the partitioned clocksweep */
 		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..8192c27066b 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -636,6 +636,16 @@
   options => 'debug_logical_replication_streaming_options',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'debug_parallel_query', type => 'enum', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Forces the planner\'s use parallel query nodes.',
   long_desc => 'This can be useful for testing the parallel query infrastructure by forcing the planner to generate plans that contain nodes that perform tuple communication between workers and the main process.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0209b2067a2..404eb3432f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1118b386228..33377841c57 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -299,10 +299,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -312,7 +312,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -555,8 +555,8 @@ extern void AtEOXact_LocalBuffers(bool isCommit);
 
 extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
-extern void BufferPartitionGet(int idx, int *num_buffers,
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
-extern void BufferPartitionParams(int *num_partitions);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4e7b1fcd4ab..510018db115 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,10 +156,12 @@ typedef struct ReadBuffersOperation ReadBuffersOperation;
 /*
  * information about one partition of shared buffers
  *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
  * first/last buffer - the values are inclusive
  */
 typedef struct BufferPartition
 {
+	int			numa_node;		/* NUMA node (-1 no node) */
 	int			num_buffers;	/* number of buffers */
 	int			first_buffer;	/* first buffer of partition */
 	int			last_buffer;	/* last buffer of partition */
@@ -169,6 +171,7 @@ typedef struct BufferPartition
 typedef struct BufferPartitions
 {
 	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
 	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
 } BufferPartitions;
 
@@ -346,6 +349,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..8ee0e7d211c 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -106,6 +109,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -128,4 +161,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
+#endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
 #endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
-- 
2.51.1

v20251111-0005-clock-sweep-weighted-balancing.patchtext/x-patch; charset=UTF-8; name=v20251111-0005-clock-sweep-weighted-balancing.patchDownload

From c97635bdd35858b44de03012268f521678cb1edc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20251111 5/7] clock-sweep: weighted balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.

Note: This may be more important with NUMA-aware partitioning, which
restricts the allowed sizes of partiions (especially with huge pages).
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3af82e267c6..8be77a9c8b1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -619,6 +619,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -645,16 +659,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -662,8 +687,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -726,6 +757,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -733,7 +768,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -747,22 +782,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.1

v20251111-0004-clock-sweep-scan-all-partitions.patchtext/x-patch; charset=UTF-8; name=v20251111-0004-clock-sweep-scan-all-partitions.patchDownload

From be684d592380765e857bcc85fb32807c5212e8b2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 13:59:29 +0200
Subject: [PATCH v20251111 4/7] clock-sweep: scan all partitions

When looking for a free buffer, scan all clock-sweep partitions, not
just the "home" one. All buffers in the home partition may be pinned, in
which case we should not fail. Instead, advance to the next partition,
in a round-robin way, and only fail after scanning through all of them.
---
 src/backend/storage/buffer/freelist.c     | 91 ++++++++++++++++-------
 src/test/recovery/t/027_stream_regress.pl |  5 --
 2 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 169071032b4..3af82e267c6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -167,6 +167,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 static ClockSweep *ChooseClockSweep(bool balance);
+static BufferDesc *StrategyGetBufferPartition(ClockSweep *sweep,
+											  BufferAccessStrategy strategy,
+											  uint32 *buf_state);
 
 /*
  * clocksweep allocation balancing
@@ -201,10 +204,9 @@ static int clocksweep_count = 0;
  * id of the buffer now under the hand.
  */
 static inline uint32
-ClockSweepTick(void)
+ClockSweepTick(ClockSweep *sweep)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -370,7 +372,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
+	ClockSweep *sweep,
+			   *sweep_start;		/* starting clock-sweep partition */
 
 	*from_ring = false;
 
@@ -424,37 +427,69 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
 	 *
-	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
-	 * buffers from a single partition, aligned with the NUMA node. That means
-	 * it only accesses buffers from the same NUMA node.
-	 *
-	 * XXX That also means each process "sweeps" only a fraction of buffers,
-	 * even if the other buffers are better candidates for eviction. Maybe
-	 * there should be some logic to "steal" buffers from other freelists or
-	 * other nodes?
+	 * Start with the "preferred" partition, and then proceed in a round-robin
+	 * manner. If we cycle back to the starting partition, it means none of the
+	 * partitions has unpinned buffers.
 	 *
-	 * XXX Would that also mean we'd have multiple bgwriters, one for each
-	 * node, or would one bgwriter handle all of that?
+	 * XXX Does this need to do similar balancing "balancing" as for bgwriter
+	 * in StrategySyncBalance? Maybe it's be enough to simply pick the initial
+	 * partition that way? We'd only getting a single buffer, so not much chance
+	 * to balance over many allocations.
 	 *
-	 * XXX This only searches a single partition, which can result in "no
-	 * unpinned buffers available" even if there are buffers in other
-	 * partitions. Should be fixed by falling back to other partitions if
-	 * needed.
-	 *
-	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
-	 * count for that one partition. In fact, this should not call ClockSweepTick
-	 * for every iteration. The call is likely quite expensive (does a lot
-	 * of stuff), and also may return a different partition on each call.
-	 * We should just do it once, and then do the for(;;) loop. And then
-	 * maybe advance to the next partition, until we scan through all of them.
+	 * XXX But actually, we're calling ChooseClockSweep() with balance=true, so
+	 * maybe it already does balancing?
 	 */
-	trycounter = NBuffers;
+	sweep = ChooseClockSweep(true);
+	sweep_start = sweep;
+	for (;;)
+	{
+		buf = StrategyGetBufferPartition(sweep, strategy, buf_state);
+
+		/* found a buffer in the "sweep" partition, we're done */
+		if (buf != NULL)
+			return buf;
+
+		/*
+		 * Try advancing to the next partition, round-robin (if last partition,
+		 * wrap around to the beginning).
+		 *
+		 * XXX This is a bit ugly, there must be a better way to advance to the
+		 * next partition.
+		 */
+		if (sweep == &StrategyControl->sweeps[StrategyControl->num_partitions - 1])
+			sweep = StrategyControl->sweeps;
+		else
+			sweep++;
+
+		/* we've scanned all partitions */
+		if (sweep == sweep_start)
+			break;
+	}
+
+	/* we shouldn't get here if there are unpinned buffers */
+	elog(ERROR, "no unpinned buffers available");
+}
+
+/*
+ * StrategyGetBufferPartition
+ *		get a free buffer from a single clock-sweep partition
+ *
+ * Returns NULL if there are no free (unpinned) buffers in the partition.
+*/
+static BufferDesc *
+StrategyGetBufferPartition(ClockSweep *sweep, BufferAccessStrategy strategy,
+						   uint32 *buf_state)
+{
+	BufferDesc *buf;
+	int			trycounter;
+
+	trycounter = sweep->numBuffers;
 	for (;;)
 	{
 		uint32		old_buf_state;
 		uint32		local_buf_state;
 
-		buf = GetBufferDescriptor(ClockSweepTick());
+		buf = GetBufferDescriptor(ClockSweepTick(sweep));
 
 		/*
 		 * Check whether the buffer can be used and pin it if so. Do this
@@ -482,7 +517,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 					 * one eventually, but it's probably better to fail than
 					 * to risk getting stuck in an infinite loop.
 					 */
-					elog(ERROR, "no unpinned buffers available");
+					return NULL;
 				}
 				break;
 			}
@@ -501,7 +536,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = sweep->numBuffers;
 					break;
 				}
 			}
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 98b146ed4b7..589c79d97d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,11 +18,6 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
-# The default is 1MB, which is not enough with clock-sweep partitioning.
-# Increase to 32MB, so that we don't get "no unpinned buffers".
-$node_primary->append_conf('postgresql.conf',
-	'shared_buffers = 32MB');
-
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
-- 
2.51.1

v20251111-0003-clock-sweep-balancing-of-allocations.patchtext/x-patch; charset=UTF-8; name=v20251111-0003-clock-sweep-balancing-of-allocations.patchDownload

From 39316faa1c8bd0f20320b4481cb0c05bab14dd46 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 29 Oct 2025 21:45:34 +0100
Subject: [PATCH v20251111 3/7] clock-sweep: balancing of allocations

If backends only allocate buffers from the "home" partition, that may
cause significant misbalance. Some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 14e750beeff..2c4d560514d 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -21,7 +21,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 1856efb8786..8536e2debef 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -794,6 +796,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -823,6 +827,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -843,11 +853,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -856,8 +872,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -883,6 +907,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[7] = Int64GetDatum(buffer_allocs);
 		nulls[7] = false;
 
+		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[8] = false;
+
+		values[9] = Int64GetDatum(buffer_req_allocs);
+		nulls[9] = false;
+
+		values[10] = PointerGetDatum(array);
+		nulls[10] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a3092ce801d..82c645a3b00 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3912,6 +3912,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index d40b09f7e69..169071032b4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -130,7 +166,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -142,7 +204,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -233,11 +295,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -309,7 +419,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -417,6 +527,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(DEBUG1, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(DEBUG1, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -443,6 +771,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -627,7 +956,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1001,8 +1344,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1010,11 +1355,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3307190f611..1118b386228 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -508,6 +508,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7052f9de57c..4e7b1fcd4ab 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -360,11 +360,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.1

v20251111-0002-clock-sweep-basic-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251111-0002-clock-sweep-basic-partitioning.patchDownload

From 34c5d3574fa7efa48de7d5a99921911b63d04b30 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:03:32 +0100
Subject: [PATCH v20251111 2/7] clock-sweep: basic partitioning

Partitions the "clock-sweep" algorithm to work on individual partitions,
one by one. Each backend process is mapped to one "home" partition, with
an independent clock hand. This reduces contention for workloads with
significant buffer pressure.

The patch extends the "pg_buffercache_partitions" view to include
information about the clock-sweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  38 ++-
 src/backend/storage/buffer/buf_init.c         |   8 +
 src/backend/storage/buffer/bufmgr.c           | 186 ++++++++----
 src/backend/storage/buffer/freelist.c         | 283 +++++++++++++++---
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 433 insertions(+), 106 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index f1c20960b7e..14e750beeff 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -15,7 +15,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	(partition integer,			-- partition index
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8a980bd1864..1856efb8786 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -809,12 +809,20 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "num_buffers",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "first_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "last_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -835,12 +843,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -853,6 +871,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[3] = Int32GetDatum(last_buffer);
 		nulls[3] = false;
 
+		values[4] = Int64GetDatum(complete_passes);
+		nulls[4] = false;
+
+		values[5] = Int32GetDatum(next_victim_buffer);
+		nulls[5] = false;
+
+		values[6] = Int64GetDatum(buffer_total_allocs);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_allocs);
+		nulls[7] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 528a368a8b7..0362fda24aa 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -319,3 +319,11 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 	elog(ERROR, "invalid partition index");
 }
+
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions)
+{
+	if (num_partitions)
+		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..a3092ce801d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3608,33 +3608,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3662,25 +3658,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3692,17 +3679,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3710,11 +3697,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3734,9 +3721,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3750,15 +3737,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3766,9 +3754,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3778,7 +3766,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3786,10 +3774,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3816,7 +3804,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3841,20 +3829,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3868,7 +3856,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3898,8 +3886,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..d40b09f7e69 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,32 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_partitions;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +130,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +142,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +150,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +183,61 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * use PID to determine the buffer partition
+ *
+ * XXX We could use NUMA node / core ID to pick partition, but we'd need
+ * to handle cases with fewer nodes/cores than partitions somehow. Although,
+ * maybe the balancing would handle that too.
+ */
+static int
+calculate_partition_index(void)
+{
+	return (MyProcPid % StrategyControl->num_partitions);
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -224,9 +309,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -306,6 +417,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -313,37 +464,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -380,6 +538,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -387,6 +548,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -402,6 +567,10 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_partitions;
+
+	num_partitions = BufferPartitionCount();
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -419,7 +588,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -431,15 +601,40 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
 	}
 	else
 		Assert(!init);
@@ -803,3 +998,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 139055a4a7d..3307190f611 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -508,7 +508,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -554,5 +556,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 24860c6c2c4..7052f9de57c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -359,6 +359,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 56e25026fbf..cb52c417592 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.1

v20251111-0001-Infrastructure-for-partitioning-shared-buf.patchtext/x-patch; charset=UTF-8; name=v20251111-0001-Infrastructure-for-partitioning-shared-buf.patchDownload

From fc35b864c5350b8640000368ddc1d82bd4ad3756 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:04:29 +0200
Subject: [PATCH v20251111 1/7] Infrastructure for partitioning shared buffers

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, etc. This serves as a source of truth
for later patches (e.g. to partition clock-sweep).

The registry is a small BufferPartitions array in shared memory, with
partitions sized to be a fair share of shared buffers. Later patches may
improve this to consider NUMA, and similar details.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* Maybe the number of partitions should be configurable? Right now it's
  hard-coded as 4, but testing shows increasing to e.g. 16) can be
  beneficial.

* This partitioning is independent of the partitions defined in
  lwlock.h, which defines 128 partitions to reduce lock conflict on the
  buffer mapping hashtable. The number of partitions introduced by this
  patch is expected to be much lower (a dozen or so).
---
 contrib/pg_buffercache/Makefile               |   2 +-
 contrib/pg_buffercache/meson.build            |   1 +
 .../pg_buffercache--1.6--1.7.sql              |  25 +++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  86 +++++++++++
 src/backend/storage/buffer/buf_init.c         | 144 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   6 +
 src/include/storage/bufmgr.h                  |  19 +++
 src/tools/pgindent/typedefs.list              |   2 +
 9 files changed, 284 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 7cd039a1df9..7c31141881f 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -24,6 +24,7 @@ install_data(
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache--1.5--1.6.sql',
+  'pg_buffercache--1.6--1.7.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..f1c20960b7e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,25 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ab790533ff6..8a980bd1864 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -776,3 +778,87 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(num_buffers);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(first_buffer);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(last_buffer);
+		nulls[3] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..528a368a8b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,11 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +29,14 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/* *
+ * number of buffer partitions */
+#define NUM_CLOCK_SWEEP_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_init(void);
 
 /*
  * Data Structures:
@@ -70,7 +83,15 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						&foundParts);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -112,6 +133,9 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/* Initialize buffer partitions (calculate buffer ranges). */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -175,5 +199,123 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+
 	return size;
 }
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+
+	/* number of buffers per partition (make sure to not overflow) */
+	int			part_buffers
+		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
+
+	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+
+	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+
+		/* buffers this partition should get (last partition can get fewer) */
+		int			num_buffers = Min(remaining_buffers, part_buffers);
+
+		remaining_buffers -= num_buffers;
+
+		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+		Assert((buffer >= 0) && (buffer < NBuffers));
+
+		part->num_buffers = num_buffers;
+		part->first_buffer = buffer;
+		part->last_buffer = buffer + (num_buffers - 1);
+
+		buffer += num_buffers;
+	}
+
+	AssertCheckBufferPartitions();
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..139055a4a7d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -345,6 +345,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -549,4 +550,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d42..24860c6c2c4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,25 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* to avoid having to expose buf_internals.h here */
 typedef struct WritebackContext WritebackContext;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277c9..56e25026fbf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.1

#73

jakub.wartak@enterprisedb.com

about 2 months ago

In reply to: Tomas Vondra (#72)

Re: Adding basic NUMA awareness

On Tue, Nov 11, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

here's a rebased patch series, fixing most of the smaller issues from
v20251101, and making cfbot happy (hopefully).

Hi Tomas,

0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?

Right. It does not belong to pg_buffercache at all, I just added it
there because I've been messing with that code already.

Please keep them in for at least for some time (perhaps standalone
patch marked as not intended to be commited would work?). I find the
view extermely useful as it will allow us pinpointing local-vs-remote
NUMA fetches (we need to know the addres).

Are you referring to the _pgproc view specifically, or also to the view
with buffer partitions? I don't intend to remove the view for shared
buffers, that's indeed useful.

Both, even the _pgproc.

Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC
partitioning, I don't think that's likely to go into 19.

Oh ok.

0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)

[..]

I didn't have time to look into all this info about mappings, io_uring
yet, so no response from me.

Ok, so the proper HP + SIGBUS explanation:

Appologies, earlier I wrote that disabling THP does workaround this,
but I've probably made an error there and used wrong binary back there
(with MAP_POPULATE in PG_MMAP_FLAGS), so please ignore that.

1. Before starting PG, with shared_buffers=32GB, huge_pages=on (2MB
ones), vm.nr_hugepages=17715, 4 NUMA nodes, kernel 6.14.x,
max_connections=10k, wal_buffers=1GB:

node0/hugepages/hugepages-2048kB/free_hugepages:4429
node1/hugepages/hugepages-2048kB/free_hugepages:4429
node2/hugepages/hugepages-2048kB/free_hugepages:4429
node3/hugepages/hugepages-2048kB/free_hugepages:4428

2. Just startup the PG with the older NUMA patchset 20251101. There
will be deficit across NUMA nodes right after startup, mostly one node
NUMA will allocate much more:

node0/hugepages/hugepages-2048kB/free_hugepages:4397
node1/hugepages/hugepages-2048kB/free_hugepages:3453
node2/hugepages/hugepages-2048kB/free_hugepages:4397
node3/hugepages/hugepages-2048kB/free_hugepages:4396

3. Check layout of NUMA maps for postmaster PID

7fc9cb200000 default file=/anon_hugepage\040(deleted) huge dirty=517
mapmax=8 N1=517 kernelpagesize_kB=2048 [!!!]
7fca0d600000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N0=32 kernelpagesize_kB=2048
7fca11600000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N1=32 kernelpagesize_kB=2048
7fca15600000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N2=32 kernelpagesize_kB=2048
7fca19600000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N3=32 kernelpagesize_kB=2048
7fca1d600000 default file=/anon_hugepage\040(deleted) huge
7fca1d800000 bind:0 file=/anon_hugepage\040(deleted) huge
7fcc1d800000 bind:1 file=/anon_hugepage\040(deleted) huge
7fce1d800000 bind:2 file=/anon_hugepage\040(deleted) huge
7fd01d800000 bind:3 file=/anon_hugepage\040(deleted) huge
7fd21d800000 default file=/anon_hugepage\040(deleted) huge dirty=425
mapmax=8 N1=425 kernelpagesize_kB=2048 [!!!]

So your patch doesn't do anything special for anything other than
Buffer Blocks and PGPROC in the above picture, so the the default
mmap() just keeps on with "default" NUMA policy which takes per above
(517+425) * 2MB = ~1884 MB of really used memory as per N1 entires. PG
does touch those regions on startup, but it doesnt really touch Buffer
Blocks. Anyway, this causes the missing amount of free huge pages on
the N1 (generates pressure on this Node 1).

So as it stands, the patchset is missing some form balancing to use
equal memory across nodes:
- each node to be forced to get certain amount of BufferBlocks/NUMA nodes blocks
- yet we do nothing and leave at the "defaults" the others regions
(e..g $SegHDR (start of shm) .. first Buffers Block), as those are
placed on the current node (due default policy), which in causes turns
this memory overallocation imbalance (so in the example N1 will get
Buffer Blocks + everything else, but that only happens on real access
not during mmap() due to lazy/first touch policy)

Currently, any launch of anything that touches imbalanced NUMA node
memory with deficit (N1 above) - use of pg_shm_allocations,
pg_buffercache - it will cause stress there and end up in SIGBUS.
This looks by design on Linux kernel side: exc:page_fault() ->
do_user_addr_fault() -> do_sigbus() AKA force_sig_fault(). But, if I
hack pg to hack do interleave (or just numactl --interleave=all ... )
to effectivley interleave those 3 "default" regions instead, so I'll
get "interleave" like that:

7fb2dd000000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
dirty=517 mapmax=8 N0=129 N1=132 N2=128 N3=128 kernelpagesize_kB=2048
7fb31f400000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N0=32 kernelpagesize_kB=2048
7fb323400000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N1=32 kernelpagesize_kB=2048
7fb327400000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N2=32 kernelpagesize_kB=2048
7fb32b400000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N3=32 kernelpagesize_kB=2048
7fb32f400000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
7fb32f600000 bind:0 file=/anon_hugepage\040(deleted) huge
7fb52f600000 bind:1 file=/anon_hugepage\040(deleted) huge
7fb72f600000 bind:2 file=/anon_hugepage\040(deleted) huge
7fb92f600000 bind:3 file=/anon_hugepage\040(deleted) huge
7fbb2f600000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
dirty=425 N0=106 N1=106 N2=105 N3=108 kernelpagesize_kB=2048

then even after fully touching everything (via select to
pg_shm_allocations), it'll run, I'll get much better balance, and wont
have SIGBUS issues:

node0/hugepages/hugepages-2048kB/free_hugepages:23
node1/hugepages/hugepages-2048kB/free_hugepages:23
node2/hugepages/hugepages-2048kB/free_hugepages:23
node3/hugepages/hugepages-2048kB/free_hugepages:22

This somehow demonstrates that enough free memory is out there, it's
just imbalance that causes SIGBUS. I hope this somehow hopefully
answers one of Your's main questions as per in the very first messages
what we should do with remaining shared_buffer members. I would like
to hear your thoughts on this, before I start benchmarking this for
real as I didnt want to bench it yet, as such interleaving could alter
the the test results.

Other things I've noticed:
- smaps Size: && Shared_Hugetlb: reporting are a lie and are showing
really touched memory, not assigned memory
- same goes for procfs's numa_maps, ignore the N[0-3] sizes, it's only
"really used", not assigned
- the best is just to manually calculate size from pointers/address range itself

-J.

#74

[1]: /messages/by-id/71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me
/messages/by-id/71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me

tomas@vondra.me

about 2 months ago

In reply to: Jakub Wartak (#73)

Re: Adding basic NUMA awareness

On 11/17/25 10:23, Jakub Wartak wrote:

On Tue, Nov 11, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

here's a rebased patch series, fixing most of the smaller issues from
v20251101, and making cfbot happy (hopefully).

Hi Tomas,

0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
called pg_shm_pgproc?

Right. It does not belong to pg_buffercache at all, I just added it
there because I've been messing with that code already.

Please keep them in for at least for some time (perhaps standalone
patch marked as not intended to be commited would work?). I find the
view extermely useful as it will allow us pinpointing local-vs-remote
NUMA fetches (we need to know the addres).

Are you referring to the _pgproc view specifically, or also to the view
with buffer partitions? I don't intend to remove the view for shared
buffers, that's indeed useful.

Both, even the _pgproc.

Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC
partitioning, I don't think that's likely to go into 19.

Oh ok.

0006d: I've got one SIGBUS during a call to select
pg_buffercache_numa_pages(); and it looks like that memory accessed is
simply not mapped? (bug)

[..]

I didn't have time to look into all this info about mappings, io_uring
yet, so no response from me.

Ok, so the proper HP + SIGBUS explanation:

Appologies, earlier I wrote that disabling THP does workaround this,
but I've probably made an error there and used wrong binary back there
(with MAP_POPULATE in PG_MMAP_FLAGS), so please ignore that.

1. Before starting PG, with shared_buffers=32GB, huge_pages=on (2MB
ones), vm.nr_hugepages=17715, 4 NUMA nodes, kernel 6.14.x,
max_connections=10k, wal_buffers=1GB:

node0/hugepages/hugepages-2048kB/free_hugepages:4429
node1/hugepages/hugepages-2048kB/free_hugepages:4429
node2/hugepages/hugepages-2048kB/free_hugepages:4429
node3/hugepages/hugepages-2048kB/free_hugepages:4428

2. Just startup the PG with the older NUMA patchset 20251101. There
will be deficit across NUMA nodes right after startup, mostly one node
NUMA will allocate much more:

node0/hugepages/hugepages-2048kB/free_hugepages:4397
node1/hugepages/hugepages-2048kB/free_hugepages:3453
node2/hugepages/hugepages-2048kB/free_hugepages:4397
node3/hugepages/hugepages-2048kB/free_hugepages:4396

3. Check layout of NUMA maps for postmaster PID

7fc9cb200000 default file=/anon_hugepage\040(deleted) huge dirty=517
mapmax=8 N1=517 kernelpagesize_kB=2048 [!!!]
7fca0d600000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N0=32 kernelpagesize_kB=2048
7fca11600000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N1=32 kernelpagesize_kB=2048
7fca15600000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N2=32 kernelpagesize_kB=2048
7fca19600000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N3=32 kernelpagesize_kB=2048
7fca1d600000 default file=/anon_hugepage\040(deleted) huge
7fca1d800000 bind:0 file=/anon_hugepage\040(deleted) huge
7fcc1d800000 bind:1 file=/anon_hugepage\040(deleted) huge
7fce1d800000 bind:2 file=/anon_hugepage\040(deleted) huge
7fd01d800000 bind:3 file=/anon_hugepage\040(deleted) huge
7fd21d800000 default file=/anon_hugepage\040(deleted) huge dirty=425
mapmax=8 N1=425 kernelpagesize_kB=2048 [!!!]

So your patch doesn't do anything special for anything other than
Buffer Blocks and PGPROC in the above picture, so the the default
mmap() just keeps on with "default" NUMA policy which takes per above
(517+425) * 2MB = ~1884 MB of really used memory as per N1 entires. PG
does touch those regions on startup, but it doesnt really touch Buffer
Blocks. Anyway, this causes the missing amount of free huge pages on
the N1 (generates pressure on this Node 1).

So as it stands, the patchset is missing some form balancing to use
equal memory across nodes:
- each node to be forced to get certain amount of BufferBlocks/NUMA nodes blocks
- yet we do nothing and leave at the "defaults" the others regions
(e..g $SegHDR (start of shm) .. first Buffers Block), as those are
placed on the current node (due default policy), which in causes turns
this memory overallocation imbalance (so in the example N1 will get
Buffer Blocks + everything else, but that only happens on real access
not during mmap() due to lazy/first touch policy)

Currently, any launch of anything that touches imbalanced NUMA node
memory with deficit (N1 above) - use of pg_shm_allocations,
pg_buffercache - it will cause stress there and end up in SIGBUS.
This looks by design on Linux kernel side: exc:page_fault() ->
do_user_addr_fault() -> do_sigbus() AKA force_sig_fault(). But, if I
hack pg to hack do interleave (or just numactl --interleave=all ... )
to effectivley interleave those 3 "default" regions instead, so I'll
get "interleave" like that:

7fb2dd000000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
dirty=517 mapmax=8 N0=129 N1=132 N2=128 N3=128 kernelpagesize_kB=2048
7fb31f400000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N0=32 kernelpagesize_kB=2048
7fb323400000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N1=32 kernelpagesize_kB=2048
7fb327400000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N2=32 kernelpagesize_kB=2048
7fb32b400000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N3=32 kernelpagesize_kB=2048
7fb32f400000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
7fb32f600000 bind:0 file=/anon_hugepage\040(deleted) huge
7fb52f600000 bind:1 file=/anon_hugepage\040(deleted) huge
7fb72f600000 bind:2 file=/anon_hugepage\040(deleted) huge
7fb92f600000 bind:3 file=/anon_hugepage\040(deleted) huge
7fbb2f600000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
dirty=425 N0=106 N1=106 N2=105 N3=108 kernelpagesize_kB=2048

then even after fully touching everything (via select to
pg_shm_allocations), it'll run, I'll get much better balance, and wont
have SIGBUS issues:

node0/hugepages/hugepages-2048kB/free_hugepages:23
node1/hugepages/hugepages-2048kB/free_hugepages:23
node2/hugepages/hugepages-2048kB/free_hugepages:23
node3/hugepages/hugepages-2048kB/free_hugepages:22

This somehow demonstrates that enough free memory is out there, it's
just imbalance that causes SIGBUS. I hope this somehow hopefully
answers one of Your's main questions as per in the very first messages
what we should do with remaining shared_buffer members. I would like
to hear your thoughts on this, before I start benchmarking this for
real as I didnt want to bench it yet, as such interleaving could alter
the the test results.

Thanks for investigating this. If I understand the findings correctly,
it agrees with my imprecise explanation in [1]/messages/by-id/71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me, right? There I said:

...
You may ask why the per-node limit is too low. We still need just
shared_memory_size_in_huge_pages, right? And if we were partitioning
the whole memory segment, that'd be true. But we only to that for
shared buffers, and there's a lot of other shared memory - could be
1-2GB or so, depending on the configuration.

And this gets placed on one of the nodes, and it counts against the
limit on that particular node. And so it doesn't have enough huge
pages to back the partition of shared buffers.
...

Which I think is mostly the same thing you're saying, and you have the
maps to support it.

In any case, I think setting "interleave" as the default policy, and
then overriding it for the areas we partition explicitly (buffers,
pgproc), seems like the right solution. The only other solution would be
balance it ourselves, but how is that different from interleaving?

So I think this makes sense, and you can do --interleave=all for the
benchmark.

I suppose we may need to adjust shared_memory_size_in_huge_pages,
because the interleave followed by explicit partitioning may still leave
behind a bit of imbalance. It should be only a couple pages, but I
haven't done the math yet.

regards

--
Tomas Vondra

#75

tomas@vondra.me

about 2 months ago

In reply to: Tomas Vondra (#74)

7 attachment(s)

Re: Adding basic NUMA awareness

Hi,

Here's an updated version of the patch series.

It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
and a incorrect array length.

The main change is in 0006 - it sets the default allocation policy for
shmem to interleaving, before doing the explicit partitioning for shared
buffers. It does it by calling numa_set_membind before the mmap(), and
then numa_interleave_memory() on the allocated shmem. It does this to
allow using MAP_POPULATE - but that's commented out by default.

This does seem to solve the SIGBUS failures for me. I still think there
might be a small chance of hitting that, because of locating an extra
"boundary" page on one of the nodes. But it should be solvable by
reserving a couple more pages.

Jakub, what do you think?

regards

--
Tomas Vondra

Attachments:

v20251121-0007-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20251121-0007-NUMA-partition-PGPROC.patchDownload

From 80415fe4e0de3c642fc15721085ed5af3d6441b1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:10:03 +0100
Subject: [PATCH v20251121 7/7] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.6--1.7.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 +++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 ++-
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 551 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 722 insertions(+), 72 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index dc2ce019283..306063e159e 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -33,3 +33,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 179a38fd6ed..cb33bf28b35 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -30,6 +31,7 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -104,6 +106,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -931,3 +934,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index ea43b432daf..7d589bac115 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -575,7 +575,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -634,7 +634,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89d0bfa7760..e0e17293536 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 976638a58ac..5e7b0ac8850 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -107,8 +107,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ce6b5299324..3288900bb6f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -292,7 +292,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index c4a888a081c..f5844aa5b6a 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 587859a5754..bdcdbcc6b5f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -799,6 +799,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 810a549efce..0937292643f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -472,7 +472,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9cb78ead105..f82e664ad3f 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1504fafe6d8..0a0ce98b725 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,22 +29,33 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -77,8 +88,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -91,6 +102,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -101,11 +135,41 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
 	return size;
 }
 
@@ -130,6 +194,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -141,6 +259,9 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
@@ -149,6 +270,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
 
+	size = add_size(size, PGProcPartitionsShmemSize());
+
 	return size;
 }
 
@@ -193,7 +316,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -212,6 +335,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -226,6 +352,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,21 +376,110 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
-	MemSet(ptr, 0, requestSize);
-
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			/* should have been aligned */
+			Assert(ptr == (char *) TYPEALIGN(numa_page_size, ptr));
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
+	/*
+	 * Don't memset the memory before locating it to NUMA nodes (which requires
+	 * the pages to be allocated but not yet faulted in memory).
+	 */
+	MemSet(ptr, 0, requestSize);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -291,23 +515,91 @@ InitProcGlobal(void)
 	/* Reserve space for semaphores. */
 	PGReserveSemaphores(ProcGlobalSemas());
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
 
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
@@ -371,9 +663,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -440,7 +729,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -651,7 +984,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1059,7 +1392,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1108,7 +1441,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1998,7 +2331,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2073,3 +2406,173 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ *
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+	{
+		/* align the pointer to the next page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+	}
+
+	elog(DEBUG1, "NUMA: pgproc_partition_init procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fa3bb09effe..c7d0bf10555 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1884,6 +1884,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.1

v20251121-0006-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251121-0006-NUMA-shared-buffers-partitioning.patchDownload

From a31e641c96beccea652f4e93ecee22398ff80b15 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:05:35 +0100
Subject: [PATCH v20251121 6/7] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.
---
 .../pg_buffercache--1.6--1.7.sql              |   1 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  52 +-
 src/backend/port/sysv_shmem.c                 |  32 +
 src/backend/storage/buffer/buf_init.c         | 569 +++++++++++++++++-
 src/backend/storage/buffer/freelist.c         |  88 ++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  14 +-
 src/include/storage/bufmgr.h                  |   4 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 12 files changed, 772 insertions(+), 72 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 2c4d560514d..dc2ce019283 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -13,6 +13,7 @@ LANGUAGE C PARALLEL SAFE;
 CREATE VIEW pg_buffercache_partitions AS
 	SELECT P.* FROM pg_buffercache_partitions() AS P
 	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
 	 last_buffer integer,		-- last buffer of partition
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index aa8ea08e1bb..179a38fd6ed 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -29,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -813,25 +813,27 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "total_req_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "num_req_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "weigths",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
 						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
@@ -849,7 +851,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	{
 		uint32		i = funcctx->call_cntr;
 
-		int			num_buffers,
+		int			numa_node,
+					num_buffers,
 					first_buffer,
 					last_buffer;
 
@@ -868,7 +871,7 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
-		BufferPartitionGet(i, &num_buffers,
+		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
@@ -886,36 +889,39 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
-		values[1] = Int32GetDatum(num_buffers);
+		values[1] = Int32GetDatum(numa_node);
 		nulls[1] = false;
 
-		values[2] = Int32GetDatum(first_buffer);
+		values[2] = Int32GetDatum(num_buffers);
 		nulls[2] = false;
 
-		values[3] = Int32GetDatum(last_buffer);
+		values[3] = Int32GetDatum(first_buffer);
 		nulls[3] = false;
 
-		values[4] = Int64GetDatum(complete_passes);
+		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
-		values[5] = Int32GetDatum(next_victim_buffer);
+		values[5] = Int64GetDatum(complete_passes);
 		nulls[5] = false;
 
-		values[6] = Int64GetDatum(buffer_total_allocs);
+		values[6] = Int32GetDatum(next_victim_buffer);
 		nulls[6] = false;
 
-		values[7] = Int64GetDatum(buffer_allocs);
+		values[7] = Int64GetDatum(buffer_total_allocs);
 		nulls[7] = false;
 
-		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
-		values[9] = Int64GetDatum(buffer_req_allocs);
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
 		nulls[9] = false;
 
-		values[10] = PointerGetDatum(array);
+		values[10] = Int64GetDatum(buffer_req_allocs);
 		nulls[10] = false;
 
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..6019bee334d 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -19,6 +19,7 @@
  */
 #include "postgres.h"
 
+#include <numa.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/file.h>
@@ -602,6 +603,14 @@ CreateAnonymousSegment(Size *size)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * Set the memory policy to interleave to all NUMA nodes before calling
+	 * mmap, in case we use MAP_POPULATE to prefault all the pages.
+	 *
+	 * XXX Probably not needed without that, but also costs nothing.
+	 */
+	numa_set_membind(numa_all_nodes_ptr);
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -616,6 +625,9 @@ CreateAnonymousSegment(Size *size)
 
 		GetHugePageSize(&hugepagesize, &mmap_flags);
 
+		// prefault the memory at start?
+		// mmap_flags |= MAP_POPULATE;
+
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
@@ -638,6 +650,11 @@ CreateAnonymousSegment(Size *size)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		int			mmap_flags = 0;
+
+		// prefault the memory at start?
+		// mmap_flags |= MAP_POPULATE;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
@@ -663,6 +680,21 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
+	/* undo the earlier num_set_membind() call. */
+	numa_set_localalloc();
+
+	/*
+	 * Before touching the memory, set the allocation policy, so that
+	 * it gets interleaved by default. We have to do this to distribute
+	 * the memory that's not located explicitly. We need this especially
+	 * with huge pages, where we could run out of huge pages on some
+	 * nodes and crash otherwise.
+	 *
+	 * XXX Probably not needed with MAP_POPULATE, in which case the policy
+	 * was already set by num_set_membind() earlier. But doesn't hurt.
+	 */
+	numa_interleave_memory(ptr, allocsize, numa_all_nodes_ptr);
+
 	*size = allocsize;
 	return ptr;
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0362fda24aa..587859a5754 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,12 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -29,15 +35,24 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-/* *
- * number of buffer partitions */
-#define NUM_CLOCK_SWEEP_PARTITIONS	4
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
 
 /* Array of structs with information about buffer ranges */
 BufferPartitions *BufferPartitionsArray = NULL;
 
+static void buffer_partitions_prepare(void);
 static void buffer_partitions_init(void);
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
+
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -85,25 +100,85 @@ BufferManagerShmemInit(void)
 				foundIOCV,
 				foundBufCkpt,
 				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
 	/* allocate the partition registry first */
 	BufferPartitionsArray = (BufferPartitions *)
 		ShmemInitStruct("Buffer Partitions",
 						offsetof(BufferPartitions, partitions) +
-						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						mul_size(sizeof(BufferPartition), numa_partitions),
 						&foundParts);
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -133,7 +208,10 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
-		/* Initialize buffer partitions (calculate buffer ranges). */
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
 		buffer_partitions_init();
 
 		/*
@@ -172,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -201,11 +286,244 @@ BufferManagerShmemSize(void)
 
 	/* account for registry of NUMA partitions */
 	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
-								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
 
 	return size;
 }
 
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	/* no NUMA-aware partitioning */
+	if ((numa_flags & NUMA_BUFFERS) == 0)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches earlier in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(NOTICE, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+	else if ((numa_flags & NUMA_BUFFERS) == 0)
+	{
+		elog(NOTICE, "NUMA-partitioning of buffers disabled");
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or when we can't partition for some
+	 * reason, just take a "fair share" of buffers. This can happen for a
+	 * number of reasons - missing NUMA support, partitioning of buffers not
+	 * enabled, or not enough buffers for this many nodes.
+	 *
+	 * We still build partitions, because we want to allow partitioning of
+	 * the clock-sweep later.
+	 *
+	 * The number of buffers for each partition is calculated later, once we
+	 * have allocated the shared memory (because that's where we store it).
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
 /*
  * Sanity checks of buffers partitions - there must be no gaps, it must cover
  * the whole range of buffers, etc.
@@ -267,33 +585,137 @@ buffer_partitions_init(void)
 {
 	int			remaining_buffers = NBuffers;
 	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
 
-	/* number of buffers per partition (make sure to not overflow) */
-	int			part_buffers
-		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
-
-	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
 
-	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	for (int n = 0; n < numa_nodes; n++)
 	{
-		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
 
-		/* buffers this partition should get (last partition can get fewer) */
-		int			num_buffers = Min(remaining_buffers, part_buffers);
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
 
-		remaining_buffers -= num_buffers;
+		remaining_buffers -= node_buffers;
 
-		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
-		Assert((buffer >= 0) && (buffer < NBuffers));
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
 
-		part->num_buffers = num_buffers;
-		part->first_buffer = buffer;
-		part->last_buffer = buffer + (num_buffers - 1);
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
 
-		buffer += num_buffers;
+			/* XXX we should get the actual node ID from the mask */
+			if (numa_can_partition)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
 	}
 
 	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not map pages
+	 * one by one.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 *
+	 * We always map all partitions for the same node at once, so that we
+	 * don't need to worry about alignment of memory pages that get split
+	 * between partitions (we only worry about min_node_buffers for whole
+	 * NUMA nodes, not for individual partitions).
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		char	   *startptr,
+				   *endptr;
+		int			num_buffers = 0;
+
+		/* sum buffers in all partitions for this node */
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int		pidx = (n * parts_per_node + p);
+			BufferPartition *part = &BufferPartitionsArray->partitions[pidx];
+
+			Assert(part->numa_node == n);
+
+			num_buffers += part->num_buffers;
+		}
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => buffers %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => descriptors %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
 }
 
 int
@@ -302,14 +724,21 @@ BufferPartitionCount(void)
 	return BufferPartitionsArray->npartitions;
 }
 
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
 void
-BufferPartitionGet(int idx, int *num_buffers,
+BufferPartitionGet(int idx, int *node, int *num_buffers,
 				   int *first_buffer, int *last_buffer)
 {
 	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
 	{
 		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
 
+		*node = part->numa_node;
 		*num_buffers = part->num_buffers;
 		*first_buffer = part->first_buffer;
 		*last_buffer = part->last_buffer;
@@ -322,8 +751,82 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 /* return parameters before the partitions are initialized (during sizing) */
 void
-BufferPartitionParams(int *num_partitions)
+BufferPartitionParams(int *num_partitions, int *num_nodes)
 {
 	if (num_partitions)
-		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 8be77a9c8b1..810a549efce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -124,7 +124,9 @@ typedef struct
 	//int			__attribute__((aligned(64))) bgwprocno;
 
 	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	/* clocksweep partitions */
 	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
@@ -270,16 +272,72 @@ ClockSweepTick(ClockSweep *sweep)
  * calculate_partition_index
  *		calculate the buffer / clock-sweep partition to use
  *
- * use PID to determine the buffer partition
- *
- * XXX We could use NUMA node / core ID to pick partition, but we'd need
- * to handle cases with fewer nodes/cores than partitions somehow. Although,
- * maybe the balancing would handle that too.
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
  */
 static int
 calculate_partition_index(void)
 {
-	return (MyProcPid % StrategyControl->num_partitions);
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
 }
 
 /*
@@ -947,7 +1005,7 @@ StrategyShmemSize(void)
 	Size		size = 0;
 	int			num_partitions;
 
-	BufferPartitionParams(&num_partitions);
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -974,9 +1032,17 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
@@ -1011,7 +1077,8 @@ StrategyInitialize(bool init)
 		/* Initialize the clock sweep pointers (for all partitions) */
 		for (int i = 0; i < num_partitions; i++)
 		{
-			int			num_buffers,
+			int			node,
+						num_buffers,
 						first_buffer,
 						last_buffer;
 
@@ -1020,7 +1087,8 @@ StrategyInitialize(bool init)
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
 			/* get info about the buffer partition */
-			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
 
 			/*
 			 * FIXME This may not quite right, because if NBuffers is not a
@@ -1056,6 +1124,8 @@ StrategyInitialize(bool init)
 
 		/* initialize the partitioned clocksweep */
 		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..8192c27066b 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -636,6 +636,16 @@
   options => 'debug_logical_replication_streaming_options',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'debug_parallel_query', type => 'enum', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Forces the planner\'s use parallel query nodes.',
   long_desc => 'This can be useful for testing the parallel query infrastructure by forcing the planner to generate plans that contain nodes that perform tuple communication between workers and the main process.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0209b2067a2..404eb3432f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1118b386228..33377841c57 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -299,10 +299,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -312,7 +312,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -555,8 +555,8 @@ extern void AtEOXact_LocalBuffers(bool isCommit);
 
 extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
-extern void BufferPartitionGet(int idx, int *num_buffers,
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
-extern void BufferPartitionParams(int *num_partitions);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4e7b1fcd4ab..510018db115 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,10 +156,12 @@ typedef struct ReadBuffersOperation ReadBuffersOperation;
 /*
  * information about one partition of shared buffers
  *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
  * first/last buffer - the values are inclusive
  */
 typedef struct BufferPartition
 {
+	int			numa_node;		/* NUMA node (-1 no node) */
 	int			num_buffers;	/* number of buffers */
 	int			first_buffer;	/* first buffer of partition */
 	int			last_buffer;	/* last buffer of partition */
@@ -169,6 +171,7 @@ typedef struct BufferPartition
 typedef struct BufferPartitions
 {
 	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
 	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
 } BufferPartitions;
 
@@ -346,6 +349,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 540ada3f8ef..d9c3841e078 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -116,6 +119,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -138,4 +171,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
 #endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
-- 
2.51.1

v20251121-0005-clock-sweep-weighted-balancing.patchtext/x-patch; charset=UTF-8; name=v20251121-0005-clock-sweep-weighted-balancing.patchDownload

From 1f98649b8db51c26d6488fc3cd30de8298518800 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20251121 5/7] clock-sweep: weighted balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.

Note: This may be more important with NUMA-aware partitioning, which
restricts the allowed sizes of partiions (especially with huge pages).
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3af82e267c6..8be77a9c8b1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -619,6 +619,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -645,16 +659,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -662,8 +687,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -726,6 +757,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -733,7 +768,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -747,22 +782,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.1

v20251121-0004-clock-sweep-scan-all-partitions.patchtext/x-patch; charset=UTF-8; name=v20251121-0004-clock-sweep-scan-all-partitions.patchDownload

From 4cbe8788366287b0347f006ee90bf1470af74877 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 13:59:29 +0200
Subject: [PATCH v20251121 4/7] clock-sweep: scan all partitions

When looking for a free buffer, scan all clock-sweep partitions, not
just the "home" one. All buffers in the home partition may be pinned, in
which case we should not fail. Instead, advance to the next partition,
in a round-robin way, and only fail after scanning through all of them.
---
 src/backend/storage/buffer/freelist.c     | 91 ++++++++++++++++-------
 src/test/recovery/t/027_stream_regress.pl |  5 --
 2 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 169071032b4..3af82e267c6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -167,6 +167,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 static ClockSweep *ChooseClockSweep(bool balance);
+static BufferDesc *StrategyGetBufferPartition(ClockSweep *sweep,
+											  BufferAccessStrategy strategy,
+											  uint32 *buf_state);
 
 /*
  * clocksweep allocation balancing
@@ -201,10 +204,9 @@ static int clocksweep_count = 0;
  * id of the buffer now under the hand.
  */
 static inline uint32
-ClockSweepTick(void)
+ClockSweepTick(ClockSweep *sweep)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -370,7 +372,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
+	ClockSweep *sweep,
+			   *sweep_start;		/* starting clock-sweep partition */
 
 	*from_ring = false;
 
@@ -424,37 +427,69 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
 	 *
-	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
-	 * buffers from a single partition, aligned with the NUMA node. That means
-	 * it only accesses buffers from the same NUMA node.
-	 *
-	 * XXX That also means each process "sweeps" only a fraction of buffers,
-	 * even if the other buffers are better candidates for eviction. Maybe
-	 * there should be some logic to "steal" buffers from other freelists or
-	 * other nodes?
+	 * Start with the "preferred" partition, and then proceed in a round-robin
+	 * manner. If we cycle back to the starting partition, it means none of the
+	 * partitions has unpinned buffers.
 	 *
-	 * XXX Would that also mean we'd have multiple bgwriters, one for each
-	 * node, or would one bgwriter handle all of that?
+	 * XXX Does this need to do similar balancing "balancing" as for bgwriter
+	 * in StrategySyncBalance? Maybe it's be enough to simply pick the initial
+	 * partition that way? We'd only getting a single buffer, so not much chance
+	 * to balance over many allocations.
 	 *
-	 * XXX This only searches a single partition, which can result in "no
-	 * unpinned buffers available" even if there are buffers in other
-	 * partitions. Should be fixed by falling back to other partitions if
-	 * needed.
-	 *
-	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
-	 * count for that one partition. In fact, this should not call ClockSweepTick
-	 * for every iteration. The call is likely quite expensive (does a lot
-	 * of stuff), and also may return a different partition on each call.
-	 * We should just do it once, and then do the for(;;) loop. And then
-	 * maybe advance to the next partition, until we scan through all of them.
+	 * XXX But actually, we're calling ChooseClockSweep() with balance=true, so
+	 * maybe it already does balancing?
 	 */
-	trycounter = NBuffers;
+	sweep = ChooseClockSweep(true);
+	sweep_start = sweep;
+	for (;;)
+	{
+		buf = StrategyGetBufferPartition(sweep, strategy, buf_state);
+
+		/* found a buffer in the "sweep" partition, we're done */
+		if (buf != NULL)
+			return buf;
+
+		/*
+		 * Try advancing to the next partition, round-robin (if last partition,
+		 * wrap around to the beginning).
+		 *
+		 * XXX This is a bit ugly, there must be a better way to advance to the
+		 * next partition.
+		 */
+		if (sweep == &StrategyControl->sweeps[StrategyControl->num_partitions - 1])
+			sweep = StrategyControl->sweeps;
+		else
+			sweep++;
+
+		/* we've scanned all partitions */
+		if (sweep == sweep_start)
+			break;
+	}
+
+	/* we shouldn't get here if there are unpinned buffers */
+	elog(ERROR, "no unpinned buffers available");
+}
+
+/*
+ * StrategyGetBufferPartition
+ *		get a free buffer from a single clock-sweep partition
+ *
+ * Returns NULL if there are no free (unpinned) buffers in the partition.
+*/
+static BufferDesc *
+StrategyGetBufferPartition(ClockSweep *sweep, BufferAccessStrategy strategy,
+						   uint32 *buf_state)
+{
+	BufferDesc *buf;
+	int			trycounter;
+
+	trycounter = sweep->numBuffers;
 	for (;;)
 	{
 		uint32		old_buf_state;
 		uint32		local_buf_state;
 
-		buf = GetBufferDescriptor(ClockSweepTick());
+		buf = GetBufferDescriptor(ClockSweepTick(sweep));
 
 		/*
 		 * Check whether the buffer can be used and pin it if so. Do this
@@ -482,7 +517,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 					 * one eventually, but it's probably better to fail than
 					 * to risk getting stuck in an infinite loop.
 					 */
-					elog(ERROR, "no unpinned buffers available");
+					return NULL;
 				}
 				break;
 			}
@@ -501,7 +536,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = sweep->numBuffers;
 					break;
 				}
 			}
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 98b146ed4b7..589c79d97d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,11 +18,6 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
-# The default is 1MB, which is not enough with clock-sweep partitioning.
-# Increase to 32MB, so that we don't get "no unpinned buffers".
-$node_primary->append_conf('postgresql.conf',
-	'shared_buffers = 32MB');
-
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
-- 
2.51.1

v20251121-0003-clock-sweep-balancing-of-allocations.patchtext/x-patch; charset=UTF-8; name=v20251121-0003-clock-sweep-balancing-of-allocations.patchDownload

From ee6b06ec3c011eb4530d39f071b2b2c09416cabc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 29 Oct 2025 21:45:34 +0100
Subject: [PATCH v20251121 3/7] clock-sweep: balancing of allocations

If backends only allocate buffers from the "home" partition, that may
cause significant misbalance. Some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.6--1.7.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 14e750beeff..2c4d560514d 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -21,7 +21,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 11c22dda9b3..aa8ea08e1bb 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -794,6 +796,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -823,6 +827,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -843,11 +853,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -856,8 +872,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -883,6 +907,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[7] = Int64GetDatum(buffer_allocs);
 		nulls[7] = false;
 
+		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[8] = false;
+
+		values[9] = Int64GetDatum(buffer_req_allocs);
+		nulls[9] = false;
+
+		values[10] = PointerGetDatum(array);
+		nulls[10] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a3092ce801d..82c645a3b00 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3912,6 +3912,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index d40b09f7e69..169071032b4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -130,7 +166,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -142,7 +204,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -233,11 +295,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -309,7 +419,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -417,6 +527,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(DEBUG1, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(DEBUG1, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -443,6 +771,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -627,7 +956,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1001,8 +1344,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1010,11 +1355,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3307190f611..1118b386228 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -508,6 +508,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7052f9de57c..4e7b1fcd4ab 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -360,11 +360,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.1

v20251121-0002-clock-sweep-basic-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251121-0002-clock-sweep-basic-partitioning.patchDownload

From 920df09b7e585a44c2a528613786527461c5581c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:03:32 +0100
Subject: [PATCH v20251121 2/7] clock-sweep: basic partitioning

Partitions the "clock-sweep" algorithm to work on individual partitions,
one by one. Each backend process is mapped to one "home" partition, with
an independent clock hand. This reduces contention for workloads with
significant buffer pressure.

The patch extends the "pg_buffercache_partitions" view to include
information about the clock-sweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.6--1.7.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  32 +-
 src/backend/storage/buffer/buf_init.c         |   8 +
 src/backend/storage/buffer/bufmgr.c           | 186 ++++++++----
 src/backend/storage/buffer/freelist.c         | 283 +++++++++++++++---
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 430 insertions(+), 103 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index f1c20960b7e..14e750beeff 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -15,7 +15,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	(partition integer,			-- partition index
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index f11ef46dbed..11c22dda9b3 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -815,6 +815,14 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -835,12 +843,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -853,6 +871,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[3] = Int32GetDatum(last_buffer);
 		nulls[3] = false;
 
+		values[4] = Int64GetDatum(complete_passes);
+		nulls[4] = false;
+
+		values[5] = Int32GetDatum(next_victim_buffer);
+		nulls[5] = false;
+
+		values[6] = Int64GetDatum(buffer_total_allocs);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_allocs);
+		nulls[7] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 528a368a8b7..0362fda24aa 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -319,3 +319,11 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 	elog(ERROR, "invalid partition index");
 }
+
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions)
+{
+	if (num_partitions)
+		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..a3092ce801d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3608,33 +3608,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3662,25 +3658,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3692,17 +3679,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3710,11 +3697,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3734,9 +3721,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3750,15 +3737,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3766,9 +3754,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3778,7 +3766,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3786,10 +3774,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3816,7 +3804,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3841,20 +3829,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3868,7 +3856,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3898,8 +3886,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..d40b09f7e69 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,32 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_partitions;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +130,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +142,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +150,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +183,61 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * use PID to determine the buffer partition
+ *
+ * XXX We could use NUMA node / core ID to pick partition, but we'd need
+ * to handle cases with fewer nodes/cores than partitions somehow. Although,
+ * maybe the balancing would handle that too.
+ */
+static int
+calculate_partition_index(void)
+{
+	return (MyProcPid % StrategyControl->num_partitions);
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -224,9 +309,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -306,6 +417,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -313,37 +464,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -380,6 +538,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -387,6 +548,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -402,6 +567,10 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_partitions;
+
+	num_partitions = BufferPartitionCount();
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -419,7 +588,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -431,15 +601,40 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
 	}
 	else
 		Assert(!init);
@@ -803,3 +998,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 139055a4a7d..3307190f611 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -508,7 +508,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -554,5 +556,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 24860c6c2c4..7052f9de57c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -359,6 +359,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4bc3c31d638..fa3bb09effe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.1

v20251121-0001-Infrastructure-for-partitioning-shared-buf.patchtext/x-patch; charset=UTF-8; name=v20251121-0001-Infrastructure-for-partitioning-shared-buf.patchDownload

From 9aef8aa328dda058ba3ea2e0dc4c65297ea09c3e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:04:29 +0200
Subject: [PATCH v20251121 1/7] Infrastructure for partitioning shared buffers

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, etc. This serves as a source of truth
for later patches (e.g. to partition clock-sweep).

The registry is a small BufferPartitions array in shared memory, with
partitions sized to be a fair share of shared buffers. Later patches may
improve this to consider NUMA, and similar details.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* Maybe the number of partitions should be configurable? Right now it's
  hard-coded as 4, but testing shows increasing to e.g. 16) can be
  beneficial.

* This partitioning is independent of the partitions defined in
  lwlock.h, which defines 128 partitions to reduce lock conflict on the
  buffer mapping hashtable. The number of partitions introduced by this
  patch is expected to be much lower (a dozen or so).
---
 contrib/pg_buffercache/Makefile               |   2 +-
 contrib/pg_buffercache/meson.build            |   1 +
 .../pg_buffercache--1.6--1.7.sql              |  25 +++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  86 +++++++++++
 src/backend/storage/buffer/buf_init.c         | 144 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   6 +
 src/include/storage/bufmgr.h                  |  19 +++
 src/tools/pgindent/typedefs.list              |   2 +
 9 files changed, 284 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
 	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
-	pg_buffercache--1.5--1.6.sql
+	pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 7cd039a1df9..7c31141881f 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -24,6 +24,7 @@ install_data(
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache--1.5--1.6.sql',
+  'pg_buffercache--1.6--1.7.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..f1c20960b7e
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,25 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c29b784dfa1..f11ef46dbed 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -776,3 +778,87 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(num_buffers);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(first_buffer);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(last_buffer);
+		nulls[3] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..528a368a8b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,11 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +29,14 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/* *
+ * number of buffer partitions */
+#define NUM_CLOCK_SWEEP_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_init(void);
 
 /*
  * Data Structures:
@@ -70,7 +83,15 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						&foundParts);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -112,6 +133,9 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/* Initialize buffer partitions (calculate buffer ranges). */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -175,5 +199,123 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+
 	return size;
 }
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+
+	/* number of buffers per partition (make sure to not overflow) */
+	int			part_buffers
+		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
+
+	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+
+	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+
+		/* buffers this partition should get (last partition can get fewer) */
+		int			num_buffers = Min(remaining_buffers, part_buffers);
+
+		remaining_buffers -= num_buffers;
+
+		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+		Assert((buffer >= 0) && (buffer < NBuffers));
+
+		part->num_buffers = num_buffers;
+		part->first_buffer = buffer;
+		part->last_buffer = buffer + (num_buffers - 1);
+
+		buffer += num_buffers;
+	}
+
+	AssertCheckBufferPartitions();
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..139055a4a7d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -345,6 +345,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -549,4 +550,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d42..24860c6c2c4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,25 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* to avoid having to expose buf_internals.h here */
 typedef struct WritebackContext WritebackContext;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 27a4d131897..4bc3c31d638 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.1

#76

[1]: /messages/by-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me

jakub.wartak@enterprisedb.com

about 2 months ago

In reply to: Tomas Vondra (#75)

3 attachment(s)

Re: Adding basic NUMA awareness

Hi Tomas!

[..]

Which I think is mostly the same thing you're saying, and you have the maps to support it.

Right, the thread is kind of long, you were right back then, well but
at least we've got a solid explanation with data.

Here's an updated version of the patch series.

Just for double confirmation, I've used those ones (v20251121*) and
they indeed interleaved parts of shm memory.

It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
and a incorrect array length.

You'll need to rebase again, pg_buffercache_numa got updated again on
Monday and clashes with 0006.

The main change is in 0006 - it sets the default allocation policy for
shmem to interleaving, before doing the explicit partitioning for shared
buffers. It does it by calling numa_set_membind before the mmap(), and
then numa_interleave_memory() on the allocated shmem. It does this to
allow using MAP_POPULATE - but that's commented out by default.

This does seem to solve the SIGBUS failures for me. I still think there
might be a small chance of hitting that, because of locating an extra
"boundary" page on one of the nodes. But it should be solvable by
reserving a couple more pages.

I can confirm, never got any SIGBUS during the later described
benchmarks, so it's much better now.

Jakub, what do you think?

On one side not using MAP_POPULATE gives instant startup, but on the
other it gives much better predictability latencies especially fresh
after starting up (this might matter to folks who like to benchmark --
us?, but initially I've just used it as a simple hack to touch
memory). I would be wary of using MAP_POPULATE with s_b when it would
be sized in hundreths of GBs, it could take minutes in startup, which
would be terrible if someone would hit SIGSEGV on production and
expect restart_after_crash=true to save him. I mean WAL redo crash
would be terrible, but that would be terrible * 2. Also pretty
long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
it would hurt even more, so I think that would be a bad path(?)

I've benchmarked the thing in two scenarios (readonly pgbench < s_b
size across variations of code and connections and 2nd one with
seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
CPU states, no turbo boost, and so on, literally great home heater
when there's -3C outside!)

The data is baseline "100%" for master along with HP on/off (so it's
showing diff % from respective HP setting):

scenario I: pgbench -S

connections
branch HP 1 8 64 128 1024
master off 100.00% 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00% 100.00%
numa16 off 99.13% 100.46% 99.66% 99.44% 89.60%
numa16 on 101.80% 100.89% 99.36% 99.89% 93.43%
numa4 off 96.82% 100.61% 99.37% 99.92% 94.41%
numa4 on 101.83% 100.61% 99.35% 99.69% 101.48%
pgproc16 off 99.13% 100.84% 99.38% 99.85% 91.15%
pgproc16 on 101.72% 101.40% 99.72% 100.14% 95.20%
pgproc4 off 98.63% 101.44% 100.05% 100.14% 90.97%
pgproc4 on 101.05% 101.46% 99.92% 100.31% 97.60%
sweep16 off 99.53% 101.14% 100.71% 100.75% 101.52%
sweep16 on 97.63% 102.49% 100.42% 100.75% 105.56%
sweep4 off 99.43% 101.59% 100.06% 100.45% 104.63%
sweep4 on 97.69% 101.59% 100.70% 100.69% 104.70%

I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.

scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
without PQ by:
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;

connections
branch HP 1 8 64 128
master off 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00%
numa16 off 115.62% 108.87% 101.08% 111.56%
numa16 on 107.68% 104.90% 102.98% 105.51%
numa4 off 113.55% 111.41% 101.45% 113.10%
numa4 on 107.90% 106.60% 103.68% 106.98%
pgproc16 off 111.70% 108.27% 98.69% 109.36%
pgproc16 on 106.98% 100.69% 101.98% 103.42%
pgproc4 off 112.41% 106.15% 100.03% 112.03%
pgproc4 on 106.73% 105.77% 103.74% 101.13%
sweep16 off 100.63% 100.38% 98.41% 103.46%
sweep16 on 109.03% 99.15% 101.17% 99.19%
sweep4 off 102.04% 101.16% 101.71% 91.86%
sweep4 on 108.33% 101.69% 97.14% 100.92%

The benefit varies with like +3-10% depending on connection count.
Quite frankly I was expecting a little bit more, especially after
re-reading [1]/messages/by-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me. Maybe you preloaded it there using pg_prewarm? (here
I've randomly warmed it using pgbench). Probably it's something with
my test, I'll take yet another look hopefully soon. The good thing is
that it never crashed and I haven't seen any errors like "Bad address"
probably related to AIO as you saw in [1]/messages/by-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me, perhaps I wasn't using
uring.

0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)

-J.

#77

tomas@vondra.me

about 2 months ago

In reply to: Jakub Wartak (#76)

9 attachment(s)

Re: Adding basic NUMA awareness

On 11/25/25 15:12, Jakub Wartak wrote:

Hi Tomas!

[..]

Which I think is mostly the same thing you're saying, and you have the maps to support it.

Right, the thread is kind of long, you were right back then, well but
at least we've got a solid explanation with data.

Here's an updated version of the patch series.

Just for double confirmation, I've used those ones (v20251121*) and
they indeed interleaved parts of shm memory.

It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
and a incorrect array length.

You'll need to rebase again, pg_buffercache_numa got updated again on
Monday and clashes with 0006.

Rebased patch series attached.

The main change is in 0006 - it sets the default allocation policy for
shmem to interleaving, before doing the explicit partitioning for shared
buffers. It does it by calling numa_set_membind before the mmap(), and
then numa_interleave_memory() on the allocated shmem. It does this to
allow using MAP_POPULATE - but that's commented out by default.

This does seem to solve the SIGBUS failures for me. I still think there
might be a small chance of hitting that, because of locating an extra
"boundary" page on one of the nodes. But it should be solvable by
reserving a couple more pages.

I can confirm, never got any SIGBUS during the later described
benchmarks, so it's much better now.

Good!

Jakub, what do you think?

On one side not using MAP_POPULATE gives instant startup, but on the
other it gives much better predictability latencies especially fresh
after starting up (this might matter to folks who like to benchmark --
us?, but initially I've just used it as a simple hack to touch
memory). I would be wary of using MAP_POPULATE with s_b when it would
be sized in hundreths of GBs, it could take minutes in startup, which
would be terrible if someone would hit SIGSEGV on production and
expect restart_after_crash=true to save him. I mean WAL redo crash
would be terrible, but that would be terrible * 2. Also pretty
long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
it would hurt even more, so I think that would be a bad path(?)

I think the MAP_POPULATE should be optional, enabled by GUC.

I've benchmarked the thing in two scenarios (readonly pgbench < s_b
size across variations of code and connections and 2nd one with
seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
CPU states, no turbo boost, and so on, literally great home heater
when there's -3C outside!)

The data is baseline "100%" for master along with HP on/off (so it's
showing diff % from respective HP setting):

scenario I: pgbench -S

connections
branch HP 1 8 64 128 1024
master off 100.00% 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00% 100.00%
numa16 off 99.13% 100.46% 99.66% 99.44% 89.60%
numa16 on 101.80% 100.89% 99.36% 99.89% 93.43%
numa4 off 96.82% 100.61% 99.37% 99.92% 94.41%
numa4 on 101.83% 100.61% 99.35% 99.69% 101.48%
pgproc16 off 99.13% 100.84% 99.38% 99.85% 91.15%
pgproc16 on 101.72% 101.40% 99.72% 100.14% 95.20%
pgproc4 off 98.63% 101.44% 100.05% 100.14% 90.97%
pgproc4 on 101.05% 101.46% 99.92% 100.31% 97.60%
sweep16 off 99.53% 101.14% 100.71% 100.75% 101.52%
sweep16 on 97.63% 102.49% 100.42% 100.75% 105.56%
sweep4 off 99.43% 101.59% 100.06% 100.45% 104.63%
sweep4 on 97.69% 101.59% 100.70% 100.69% 104.70%

I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.

I wouldn't expect a big difference for "pgbench -S". That workload has
so much other fairly expensive stuff (e.g. initializing index scans
etc.), the cost of buffer replacement is going to be fairly limited.

The regressions for numa/pgproc patches with 1024 clients are annoying,
but how realistic is such scenario? With 32/64 CPUs, having 1024 active
connections is a substantial overload. If we can fix this, great. But I
think such regression may be OK if we get benefits for reasonable setups
(with fewer clients).

I don't know why it's happening, though. I haven't been testing cases
with so many clients (compared to number of CPUs).

scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
without PQ by:
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;

connections
branch HP 1 8 64 128
master off 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00%
numa16 off 115.62% 108.87% 101.08% 111.56%
numa16 on 107.68% 104.90% 102.98% 105.51%
numa4 off 113.55% 111.41% 101.45% 113.10%
numa4 on 107.90% 106.60% 103.68% 106.98%
pgproc16 off 111.70% 108.27% 98.69% 109.36%
pgproc16 on 106.98% 100.69% 101.98% 103.42%
pgproc4 off 112.41% 106.15% 100.03% 112.03%
pgproc4 on 106.73% 105.77% 103.74% 101.13%
sweep16 off 100.63% 100.38% 98.41% 103.46%
sweep16 on 109.03% 99.15% 101.17% 99.19%
sweep4 off 102.04% 101.16% 101.71% 91.86%
sweep4 on 108.33% 101.69% 97.14% 100.92%

The benefit varies with like +3-10% depending on connection count.
Quite frankly I was expecting a little bit more, especially after
re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
I've randomly warmed it using pgbench). Probably it's something with
my test, I'll take yet another look hopefully soon. The good thing is
that it never crashed and I haven't seen any errors like "Bad address"
probably related to AIO as you saw in [1], perhaps I wasn't using
uring.

Hmmm. I'd have expected better results for this workload. So I tried
re-running my seqscan benchmark on the 176-core instance, and I got this:

clients master 0001 0002 0003 0004 0005 0006 0007
-----------------------------------------------------------------
64 44 43 35 40 53 53 46 45
96 55 54 42 47 57 58 53 53
128 59 59 46 50 58 58 57 60

clients 0001 0002 0003 0004 0005 0006 0007
--------------------------------------------------------
64 98% 79% 92% 122% 122% 105% 104%
96 99% 76% 86% 104% 105% 97% 97%
128 99% 77% 84% 98% 98% 97% 101%

I did the benchmark for individual parts of the patch series. There's a
clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The
0002/0003 regress it quite a bit. And with 128 clients there's no
improvement at all.

This was with the default number of partitions (i.e. 4). If I increase
the number to 16, I get this:

clients master 0001 0002 0003 0004 0005 0006 0007
-----------------------------------------------------------------
64 44 43 69 82 87 87 78 79
96 55 54 65 85 91 91 86 86
128 59 59 66 77 83 83 82 86

clients 0001 0002 0003 0004 0005 0006 0007
--------------------------------------------------------
64 99% 158% 189% 199% 199% 180% 180%
96 100% 119% 156% 167% 167% 157% 158%
128 99% 112% 130% 140% 140% 139% 145%

And with 32 partitions, I get this:

clients master 0001 0002 0003 0004 0005 0006 0007
-----------------------------------------------------------------
64 44 44 88 91 90 90 84 84
96 55 54 89 93 93 92 90 91
128 59 59 85 84 86 85 88 87

clients 0001 0002 0003 0004 0005 0006 0007
--------------------------------------------------------
64 100% 202% 208% 207% 207% 193% 193%
96 100% 163% 169% 171% 168% 165% 166%
128 99% 144% 142% 146% 144% 149% 146%

Those are clearly much better results, so I guess the default number of
partitions may be too low.

What bothers me is that this seems like a very narrow benchmark. I mean,
few systems are doing concurrent seqscans putting this much pressure on
buffer replacement. And once the plans start to do other stuff, the
contention on clock sweep seems to go down substantially (as shown by
the read-only pgbench). So the question is - is this really worth it?

0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)

Should be fixed by the attached patches. The 0006 patch has an issue
with mbind too, but it was visible only when the buffers were not a nice
multiple of memory pages (and multiples of 1GB are fine).

This also moves the memset() until after placing the PGPROC partitions
to different NUMA nodes.

The results above are from v20251121. I'll rerun the tests with the nw
version of the patches. But it can only change the 0006/0007 results, of
course. The 0001-0005 are the same.

regards

--
Tomas Vondra

Attachments:

v20251126-0009-mbind-procs.patchtext/x-patch; charset=UTF-8; name=v20251126-0009-mbind-procs.patchDownload

From 6a7b5f0c48b861a78baafad699802752edfce669 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 26 Nov 2025 17:13:09 +0100
Subject: [PATCH v20251126 9/9] mbind: procs

---
 src/backend/storage/lmgr/proc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0a0ce98b725..3d76a3a6429 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -206,11 +206,14 @@ FastPathLockShmemSize(void)
 	 * when not strictly needed (if it's already aligned). And we always
 	 * assume we'll add a whole page, even if the alignment needs only less
 	 * memory.
+	 *
+	 * XXX We need two extra pages. One for the non-NUMA part (aux processes),
+	 * and one to keep the size of the last chunk aligned too.
 	 */
 	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
 		Assert(numa_nodes > 0);
-		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+		size = add_size(size, mul_size((numa_nodes + 2), numa_page_size));
 	}
 
 	return size;
@@ -324,6 +327,7 @@ InitProcGlobal(void)
 
 	/* Used for setup of per-backend fast-path slots. */
 	char	   *fpPtr,
+			   *fpPtrOrig,
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
@@ -507,7 +511,7 @@ InitProcGlobal(void)
 							requestSize,
 							&found);
 
-	MemSet(fpPtr, 0, requestSize);
+	fpPtrOrig = fpPtr;
 
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
@@ -595,6 +599,9 @@ InitProcGlobal(void)
 		Assert(fpPtr <= fpEndPtr);
 	}
 
+	/* zero the memory only after locating the memory to NUMA nodes */
+	MemSet(fpPtrOrig, 0, requestSize);
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		PGPROC	   *proc = procs[i];
@@ -2521,7 +2528,10 @@ fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
 	 * memory, to make sure it's not mapped to any node yet
 	 */
 	if (node != -1)
+	{
+		endptr = (char *) TYPEALIGN(numa_page_size, endptr);
 		pg_numa_move_to_node(ptr, endptr, node);
+	}
 
 	/*
 	 * Now point the PGPROC entries to the fast-path arrays, and also advance
@@ -2550,7 +2560,8 @@ fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
 		allprocs_index++;
 	}
 
-	Assert(ptr == endptr);
+	Assert(ptr <= endptr);
+	Assert((node == -1) || (char *) TYPEALIGN(numa_page_size, ptr) == endptr);
 
 	return endptr;
 }
-- 
2.51.1

v20251126-0008-NUMA-partition-PGPROC.patchtext/x-patch; charset=UTF-8; name=v20251126-0008-NUMA-partition-PGPROC.patchDownload

From b6e4879a397210aff0f0f720708471568fe99ea4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:10:03 +0100
Subject: [PATCH v20251126 8/9] NUMA: partition PGPROC

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

With the feature disabled, there's only a single "partition" for all
PGPROC entries.

Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?

Note: There's some challenges in making this work on EXEC_BACKEND, even
if we don't support NUMA on platforms that require this.
---
 .../pg_buffercache--1.7--1.8.sql              |  19 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  94 +++
 src/backend/access/transam/clog.c             |   4 +-
 src/backend/access/transam/twophase.c         |   3 +-
 src/backend/postmaster/launch_backend.c       |   4 +-
 src/backend/postmaster/pgarch.c               |   2 +-
 src/backend/postmaster/walsummarizer.c        |   2 +-
 src/backend/storage/buffer/buf_init.c         |   2 +
 src/backend/storage/buffer/freelist.c         |   2 +-
 src/backend/storage/ipc/procarray.c           |  85 ++-
 src/backend/storage/lmgr/lock.c               |   6 +-
 src/backend/storage/lmgr/proc.c               | 551 +++++++++++++++++-
 src/include/port/pg_numa.h                    |   1 +
 src/include/storage/proc.h                    |  18 +-
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 722 insertions(+), 72 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
index 43d2e84f9d2..265c35c8252 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
@@ -31,3 +31,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
 
 GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
 GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+	SELECT P.* FROM pg_buffercache_pgproc() AS P
+	(partition integer,
+	 numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index eae75375152..c6629f767d1 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -30,6 +31,7 @@
 
 #define NUM_BUFFERCACHE_OS_PAGES_ELEM	3
 #define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
+#define NUM_BUFFERCACHE_PGPROC_ELEM	5
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -105,6 +107,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -981,3 +984,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+						   INT8OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = ProcPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			numa_node,
+					num_procs;
+
+		void	   *pgproc_ptr,
+				   *fastpath_ptr;
+
+		Datum		values[NUM_BUFFERCACHE_PGPROC_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+		ProcPartitionGet(i, &numa_node, &num_procs,
+						 &pgproc_ptr, &fastpath_ptr);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(numa_node);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(num_procs);
+		nulls[2] = false;
+
+		values[3] = PointerGetDatum(pgproc_ptr);
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(fastpath_ptr);
+		nulls[4] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index ea43b432daf..7d589bac115 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -575,7 +575,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -634,7 +634,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89d0bfa7760..e0e17293536 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -282,7 +282,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = GetNumberFromPGProc(&PreparedXactProcs[i]);
+			gxacts[i].pgprocno = GetNumberFromPGProc(PreparedXactProcs[i]);
 		}
 	}
 	else
@@ -447,6 +447,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, FullTransactionId fxid,
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
+	proc->procnumber = gxact->pgprocno;
 	dlist_node_init(&proc->links);
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	if (LocalTransactionIdIsValid(MyProc->vxid.lxid))
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 976638a58ac..5e7b0ac8850 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -107,8 +107,8 @@ typedef struct
 	LWLockPadded *MainLWLockArray;
 	slock_t    *ProcStructLock;
 	PROC_HDR   *ProcGlobal;
-	PGPROC	   *AuxiliaryProcs;
-	PGPROC	   *PreparedXactProcs;
+	PGPROC	   **AuxiliaryProcs;
+	PGPROC	   **PreparedXactProcs;
 	volatile PMSignalData *PMSignalState;
 	ProcSignalHeader *ProcSignal;
 	pid_t		PostmasterPid;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ce6b5299324..3288900bb6f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -292,7 +292,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index c4a888a081c..f5844aa5b6a 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 9ba455488a0..55016cce93f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -824,6 +824,8 @@ check_debug_numa(char **newval, void **extra, GucSource source)
 
 		if (pg_strcasecmp(item, "buffers") == 0)
 			flags |= NUMA_BUFFERS;
+		else if (pg_strcasecmp(item, "procs") == 0)
+			flags |= NUMA_PROCS;
 		else
 		{
 			GUC_check_errdetail("Invalid option \"%s\".", item);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 810a549efce..0937292643f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -472,7 +472,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 200f72c6e25..7e28fbdfea3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -369,6 +369,8 @@ static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
+static void AssertCheckAllProcs(void);
+
 /*
  * Report shared-memory space needed by ProcArrayShmemInit
  */
@@ -476,6 +478,8 @@ ProcArrayAdd(PGPROC *proc)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
 		/*
@@ -502,7 +506,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,11 +542,13 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -578,10 +584,12 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
+	AssertCheckAllProcs();
+
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,11 +644,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
+	AssertCheckAllProcs();
+
 	/*
 	 * Release in reversed acquisition order, to reduce frequency of having to
 	 * wait for XidGenLock while holding ProcArrayLock.
@@ -860,7 +870,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +890,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1536,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1632,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	return false;
 }
 
-
 /*
  * Determine XID horizons.
  *
@@ -1740,7 +1749,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2224,7 +2233,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2307,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2499,7 +2508,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2725,7 +2734,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2756,7 +2765,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -2858,7 +2867,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
 	{
 		TransactionId xid;
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3029,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3061,7 +3070,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3198,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3232,7 +3241,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3301,7 +3310,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3403,7 +3412,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3468,7 +3477,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3532,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3578,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3598,7 +3607,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3629,7 +3638,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3670,7 +3679,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3733,7 +3742,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3799,7 +3808,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
@@ -5227,3 +5236,15 @@ KnownAssignedXidsReset(void)
 
 	LWLockRelease(ProcArrayLock);
 }
+
+static void
+AssertCheckAllProcs(void)
+{
+	ProcArrayStruct *arrayP = procArray;
+	int		numProcs = arrayP->numProcs;
+
+	for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
+	{
+		Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
+	}
+}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9cb78ead105..f82e664ad3f 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2876,7 +2876,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3135,7 +3135,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3822,7 +3822,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1504fafe6d8..0a0ce98b725 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,22 +29,33 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -77,8 +88,8 @@ NON_EXEC_STATIC slock_t *ProcStructLock = NULL;
 
 /* Pointers to shared-memory structures */
 PROC_HDR   *ProcGlobal = NULL;
-NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
-PGPROC	   *PreparedXactProcs = NULL;
+NON_EXEC_STATIC PGPROC **AuxiliaryProcs = NULL;
+PGPROC	   **PreparedXactProcs = NULL;
 
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
@@ -91,6 +102,29 @@ static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_procs_per_node = -1;	/* pgprocs per node */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+								   int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+									 int allprocs_index, int node,
+									 Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+	int			num_procs;
+	int			numa_node;
+	void	   *pgproc_ptr;
+	void	   *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
 /*
  * Report shared-memory space needed by PGPROC.
  */
@@ -101,11 +135,41 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, CACHELINEALIGN(mul_size(TotalProcs, sizeof(PGPROC *))));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * To support NUMA partitioning, the PGPROC array will be divided into
+	 * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+	 * entries (which are not assigned to any NUMA node).
+	 *
+	 * We can't simply map pages of a single continuous array, because the
+	 * PGPROC entries are very small and too many of them would fit on a
+	 * single page (at least with huge pages). Far more than reasonable values
+	 * of max_connections. So instead we cut the array into separate pieces
+	 * for each node.
+	 *
+	 * Each piece may need up to one memory page of padding, to make it
+	 * aligned with memory page (for NUMA), So we just add a page - it's a bit
+	 * wasteful, but should not matter much - NUMA is meant for large boxes,
+	 * so a couple pages is negligible.
+	 *
+	 * We only do this with NUMA partitioning. With the GUC disabled, or when
+	 * we find we can't do that for some reason, we just allocate the PGPROC
+	 * array as a single chunk. This is determined by the earlier call to
+	 * pgproc_partitions_prepare().
+	 *
+	 * XXX It might be more painful with very large huge pages (e.g. 1GB).
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
 	return size;
 }
 
@@ -130,6 +194,60 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * When applying NUMA to the fast-path locks, we follow the same logic as
+	 * for PGPROC entries. See the comments in PGProcShmemSize().
+	 *
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+	}
+
+	return size;
+}
+
+static Size
+PGProcPartitionsShmemSize(void)
+{
+	Size		size = 0;
+
+	/*
+	 * If PGPROC partitioning is enabled, and we decided it's possible, we
+	 * need to add one memory page per NUMA node (and one for auxiliary/2PC
+	 * processes) to allow proper alignment.
+	 *
+	 * XXX This is a a bit wasteful, because it might actually add pages even
+	 * when not strictly needed (if it's already aligned). And we always
+	 * assume we'll add a whole page, even if the alignment needs only less
+	 * memory.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		Assert(numa_nodes > 0);
+		size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+		/*
+		 * Also account for a small registry of partitions, a simple array of
+		 * partitions at the beginning.
+		 */
+		size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+	}
+	else
+	{
+		/* otherwise add only a tiny registry, with a single partition */
+		size = add_size(size, sizeof(PGProcPartition));
+	}
+
 	return size;
 }
 
@@ -141,6 +259,9 @@ ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for pgproc entries etc */
+	pgproc_partitions_prepare();
+
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
 	size = add_size(size, sizeof(slock_t));
@@ -149,6 +270,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, PGProcShmemSize());
 	size = add_size(size, FastPathLockShmemSize());
 
+	size = add_size(size, PGProcPartitionsShmemSize());
+
 	return size;
 }
 
@@ -193,7 +316,7 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
@@ -212,6 +335,9 @@ InitProcGlobal(void)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
 	Assert(!found);
 
+	/* XXX call again, EXEC_BACKEND may not see the already computed value */
+	pgproc_partitions_prepare();
+
 	/*
 	 * Initialize the data structures.
 	 */
@@ -226,6 +352,15 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* PGPROC partition registry */
+	requestSize = PGProcPartitionsShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC partitions",
+						  requestSize,
+						  &found);
+
+	partitions = (PGProcPartition *) ptr;
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,21 +376,110 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
-	MemSet(ptr, 0, requestSize);
-
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC *));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+	{
+		int			node_procs;
+		int			total_procs = 0;
+
+		Assert(numa_procs_per_node > 0);
+		Assert(numa_nodes > 0);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		/*
+		 * Now initialize the PGPROC partition registry with one partition
+		 * per NUMA node (and then one extra partition for auxiliary procs).
+		 */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* fill in the partition info */
+			partitions[i].num_procs = node_procs;
+			partitions[i].numa_node = i;
+			partitions[i].pgproc_ptr = ptr;
+
+			ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+			/* should have been aligned */
+			Assert(ptr == (char *) TYPEALIGN(numa_page_size, ptr));
+
+			total_procs += node_procs;
+
+			/* don't underflow/overflow the allocation */
+			Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+		}
+
+		Assert(total_procs == MaxBackends);
+
+		/*
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
+		 */
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* fill in the partition info */
+		partitions[numa_nodes].num_procs = node_procs;
+		partitions[numa_nodes].numa_node = -1;
+		partitions[numa_nodes].pgproc_ptr = ptr;
+
+		ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* just treat everything as a single array, with no alignment */
+		ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+		/* fill in the partition info */
+		partitions[0].num_procs = TotalProcs;
+		partitions[0].numa_node = -1;
+		partitions[0].pgproc_ptr = ptr;
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+	}
+
+	/*
+	 * Don't memset the memory before locating it to NUMA nodes (which requires
+	 * the pages to be allocated but not yet faulted in memory).
+	 */
+	MemSet(ptr, 0, requestSize);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -291,23 +515,91 @@ InitProcGlobal(void)
 	/* Reserve space for semaphores. */
 	PGReserveSemaphores(ProcGlobalSemas());
 
-	for (i = 0; i < TotalProcs; i++)
+	/*
+	 * Mimic the logic we used to partition PGPROC entries.
+	 */
+
+	/*
+	 * If NUMA partitioning is enabled, and we decided we actually can do the
+	 * partitioning, allocate the chunks.
+	 *
+	 * Otherwise we'll allocate a single array for everything. It's not quite
+	 * what we did without NUMA, because there's an extra level of
+	 * indirection, but it's the best we can do.
+	 */
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
 	{
-		PGPROC	   *proc = &procs[i];
+		int			node_procs;
+		int			total_procs = 0;
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+		Assert(numa_procs_per_node > 0);
+
+		/* build PGPROC entries for NUMA nodes */
+		for (i = 0; i < numa_nodes; i++)
+		{
+			/* the last NUMA node may get fewer PGPROC entries, but meh */
+			node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+			/* make sure to align the PGPROC array to memory page */
+			fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+			/* remember this pointer too */
+			partitions[i].fastpath_ptr = fpPtr;
+			Assert(node_procs == partitions[i].num_procs);
+
+			fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+											fpLockBitsSize, fpRelIdSize);
+
+			total_procs += node_procs;
+
+			/* don't overflow the allocation */
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(total_procs == MaxBackends);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+		 * however don't assign those to any NUMA node).
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* make sure to align the PGPROC array to memory page */
+		fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+
+		/* remember this pointer too */
+		partitions[numa_nodes].fastpath_ptr = fpPtr;
+		Assert(node_procs == partitions[numa_nodes].num_procs);
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+		fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+										fpLockBitsSize, fpRelIdSize);
 
+		total_procs += node_procs;
+
+		/* don't overflow the allocation */
+		Assert(fpPtr <= fpEndPtr);
+
+		Assert(total_procs = TotalProcs);
+	}
+	else
+	{
+		/* remember this pointer too */
+		partitions[0].fastpath_ptr = fpPtr;
+		Assert(TotalProcs == partitions[0].num_procs);
+
+		/* just treat everything as a single array, with no alignment */
+		fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+										fpLockBitsSize, fpRelIdSize);
+
+		/* don't overflow the allocation */
 		Assert(fpPtr <= fpEndPtr);
+	}
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
 
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
@@ -371,9 +663,6 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -440,7 +729,51 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if ((numa_flags & NUMA_PROCS) != 0)
+		{
+			dlist_mutable_iter iter;
+			int		node;
+
+#ifdef USE_LIBNUMA
+			int	cpu = sched_getcpu();
+
+			if (cpu < 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			node = numa_node_of_cpu(cpu);
+#else
+			/* FIXME is defaulting to 0 correct? */
+			node = 0;
+#endif
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -651,7 +984,7 @@ InitAuxiliaryProcess(void)
 	 */
 	for (proctype = 0; proctype < NUM_AUXILIARY_PROCS; proctype++)
 	{
-		auxproc = &AuxiliaryProcs[proctype];
+		auxproc = AuxiliaryProcs[proctype];
 		if (auxproc->pid == 0)
 			break;
 	}
@@ -1059,7 +1392,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	if (MyProc->pid != (int) getpid())
 		elog(PANIC, "AuxiliaryProcKill() called in child process");
 
-	auxproc = &AuxiliaryProcs[proctype];
+	auxproc = AuxiliaryProcs[proctype];
 
 	Assert(MyProc == auxproc);
 
@@ -1108,7 +1441,7 @@ AuxiliaryPidGetProc(int pid)
 
 	for (index = 0; index < NUM_AUXILIARY_PROCS; index++)
 	{
-		PGPROC	   *proc = &AuxiliaryProcs[index];
+		PGPROC	   *proc = AuxiliaryProcs[index];
 
 		if (proc->pid == pid)
 		{
@@ -1998,7 +2331,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2073,3 +2406,173 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/*
+ * pgproc_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#ifdef USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	numa_nodes = 1;
+#endif
+
+	/* XXX can this happen? */
+	if (numa_nodes < 1)
+		numa_nodes = 1;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	// Assert(!IsUnderPostmaster);
+
+	numa_page_size = pg_numa_page_size();
+
+	numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+	elog(DEBUG1, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+		 MaxBackends, numa_nodes, numa_procs_per_node);
+
+	Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+	/* success */
+	numa_can_partition = true;
+}
+
+/*
+ *
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+	PGPROC	   *procs_node;
+
+	/* allocate the PGPROC chunk for this node */
+	procs_node = (PGPROC *) ptr;
+
+	/* pointer right after this array */
+	ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+	{
+		/* align the pointer to the next page */
+		ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+		pg_numa_move_to_node((char *) procs_node, ptr, node);
+	}
+
+	elog(DEBUG1, "NUMA: pgproc_partition_init procs %p endptr %p num_procs %d node %d",
+		 procs_node, ptr, num_procs, node);
+
+	/* add pointers to the PGPROC entries to allProcs */
+	for (int i = 0; i < num_procs; i++)
+	{
+		procs_node[i].numa_node = node;
+		procs_node[i].procnumber = allprocs_index;
+
+		ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+		allprocs_index++;
+	}
+
+	return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+						Size fpLockBitsSize, Size fpRelIdSize)
+{
+	char	   *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+	/*
+	 * if node specified, move to node - do this before we start touching the
+	 * memory, to make sure it's not mapped to any node yet
+	 */
+	if (node != -1)
+		pg_numa_move_to_node(ptr, endptr, node);
+
+	/*
+	 * Now point the PGPROC entries to the fast-path arrays, and also advance
+	 * the fpPtr.
+	 */
+	for (int i = 0; i < num_procs; i++)
+	{
+		PGPROC	   *proc = ProcGlobal->allProcs[allprocs_index];
+
+		/* cross-check we got the expected NUMA node */
+		Assert(proc->numa_node == node);
+		Assert(proc->procnumber == allprocs_index);
+
+		/*
+		 * Set the fast-path lock arrays, and move the pointer. We interleave
+		 * the two arrays, to (hopefully) get some locality for each backend.
+		 */
+		proc->fpLockBits = (uint64 *) ptr;
+		ptr += fpLockBitsSize;
+
+		proc->fpRelId = (Oid *) ptr;
+		ptr += fpRelIdSize;
+
+		Assert(ptr <= endptr);
+
+		allprocs_index++;
+	}
+
+	Assert(ptr == endptr);
+
+	return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+	if (((numa_flags & NUMA_PROCS) != 0) && numa_can_partition)
+		return (numa_nodes + 1);
+
+	return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+	PGProcPartition *part = &partitions[idx];
+
+	Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+	*nprocs = part->num_procs;
+	*procsptr = part->pgproc_ptr;
+	*fpptr = part->fastpath_ptr;
+	*node = part->numa_node;
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9734aa315ff..aa524f6f7f3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -23,6 +23,7 @@ extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int n
 extern PGDLLIMPORT int numa_flags;
 
 #define		NUMA_BUFFERS		0x01
+#define		NUMA_PROCS			0x02
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..21f2619fd40 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -438,13 +443,13 @@ typedef struct PROC_HDR
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
 
-extern PGDLLIMPORT PGPROC *PreparedXactProcs;
+extern PGDLLIMPORT PGPROC **PreparedXactProcs;
 
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
@@ -480,7 +485,7 @@ extern PGDLLIMPORT bool log_lock_waits;
 
 #ifdef EXEC_BACKEND
 extern PGDLLIMPORT slock_t *ProcStructLock;
-extern PGDLLIMPORT PGPROC *AuxiliaryProcs;
+extern PGDLLIMPORT PGPROC **AuxiliaryProcs;
 #endif
 
 
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
 extern void BecomeLockGroupLeader(void);
 extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
 
+extern int	ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
 #endif							/* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 241f175e9da..e1bf02a3567 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1888,6 +1888,7 @@ PGP_MPI
 PGP_PubKey
 PGP_S2K
 PGPing
+PGProcPartition
 PGQueryClass
 PGRUsage
 PGSemaphore
-- 
2.51.1

v20251126-0007-mbind-buffers.patchtext/x-patch; charset=UTF-8; name=v20251126-0007-mbind-buffers.patchDownload

From 191b5b6503c0884260d14479fe5b19fc1cc7cc86 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 26 Nov 2025 17:12:57 +0100
Subject: [PATCH v20251126 7/9] mbind: buffers

---
 src/backend/storage/buffer/buf_init.c | 29 +++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 587859a5754..9ba455488a0 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -272,6 +272,13 @@ BufferManagerShmemSize(void)
 	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
+	/*
+	 * Extra alignment, so that the partitions are whole memory pages (we
+	 * may need to pad the last one, so one page is enough). Without this
+	 * we may get mbind() failures in pg_numa_move_to_node().
+	 */
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
+
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
@@ -695,6 +702,15 @@ buffer_partitions_init(void)
 		/* first map buffers */
 		startptr = buffers_ptr;
 		endptr = startptr + ((Size) num_buffers * BLCKSZ);
+
+		/*
+		 * Make sure the partition is a multiple of memory page, so that we
+		 * don't get mbind failures in move_to_node calls. This matters only
+		 * for the last partition, the earlier ones should be always sized
+		 * as multiples of pages.
+		 */
+		endptr = (char *) TYPEALIGN(numa_page_size, endptr);
+
 		buffers_ptr = endptr;	/* start of the next partition */
 
 		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => buffers %d start %p end %p (size %zd)",
@@ -705,6 +721,15 @@ buffer_partitions_init(void)
 		/* now do the same for buffer descriptors */
 		startptr = descriptors_ptr;
 		endptr = startptr + ((Size) num_buffers * sizeof(BufferDescPadded));
+
+		/*
+		 * Make sure the partition is a multiple of memory page, so that we
+		 * don't get mbind failures in move_to_node calls. This matters only
+		 * for the last partition, the earlier ones should be always sized
+		 * as multiples of pages.
+		 */
+		endptr = (char *) TYPEALIGN(numa_page_size, endptr);
+
 		descriptors_ptr = endptr;
 
 		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => descriptors %d start %p end %p (size %zd)",
@@ -714,8 +739,8 @@ buffer_partitions_init(void)
 	}
 
 	/* we should have consumed the arrays exactly */
-	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
-	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+	Assert(buffers_ptr <= (char *) TYPEALIGN(numa_page_size, BufferBlocks + (Size) NBuffers * BLCKSZ));
+	Assert(descriptors_ptr == (char *) TYPEALIGN(numa_page_size, (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded)));
 }
 
 int
-- 
2.51.1

v20251126-0006-NUMA-shared-buffers-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251126-0006-NUMA-shared-buffers-partitioning.patchDownload

From 4b3779f92217900c0eea91ef68810867362a7170 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:05:35 +0100
Subject: [PATCH v20251126 6/9] NUMA: shared buffers partitioning

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).

The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.

The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.

The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).

Notes:

* The feature is enabled by debug_numa = buffers GUC (default: empty),
  which works similarly to debug_io_direct.

* This patch partitions just shared buffers, not the whole shared
  memory. A later patch will do that for PGPROC, but it's tricky and
  requires a different approach because of huge pages.
---
 .../pg_buffercache--1.7--1.8.sql              |   1 +
 contrib/pg_buffercache/pg_buffercache_pages.c |  52 +-
 src/backend/port/sysv_shmem.c                 |  34 +-
 src/backend/storage/buffer/buf_init.c         | 569 +++++++++++++++++-
 src/backend/storage/buffer/freelist.c         |  88 ++-
 src/backend/utils/misc/guc_parameters.dat     |  10 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/include/port/pg_numa.h                    |   6 +
 src/include/storage/buf_internals.h           |  14 +-
 src/include/storage/bufmgr.h                  |   4 +
 src/include/utils/guc_hooks.h                 |   3 +
 src/port/pg_numa.c                            |  64 ++
 12 files changed, 773 insertions(+), 73 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
index 1834599c4b3..43d2e84f9d2 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
@@ -11,6 +11,7 @@ LANGUAGE C PARALLEL SAFE;
 CREATE VIEW pg_buffercache_partitions AS
 	SELECT P.* FROM pg_buffercache_partitions() AS P
 	(partition integer,			-- partition index
+	 numa_node integer,			-- NUMA node of the partitioon
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
 	 last_buffer integer,		-- last buffer of partition
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 81665209084..eae75375152 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -29,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_OS_PAGES_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	12
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -863,25 +863,27 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "num_passes",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "next_buffer",
 						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "total_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "num_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "total_req_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "total_req_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "num_req_allocs",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "num_req_allocs",
 						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "weigths",
+		TupleDescInitEntry(tupledesc, (AttrNumber) 12, "weigths",
 						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
@@ -899,7 +901,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 	{
 		uint32		i = funcctx->call_cntr;
 
-		int			num_buffers,
+		int			numa_node,
+					num_buffers,
 					first_buffer,
 					last_buffer;
 
@@ -918,7 +921,7 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
-		BufferPartitionGet(i, &num_buffers,
+		BufferPartitionGet(i, &numa_node, &num_buffers,
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
@@ -936,36 +939,39 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
-		values[1] = Int32GetDatum(num_buffers);
+		values[1] = Int32GetDatum(numa_node);
 		nulls[1] = false;
 
-		values[2] = Int32GetDatum(first_buffer);
+		values[2] = Int32GetDatum(num_buffers);
 		nulls[2] = false;
 
-		values[3] = Int32GetDatum(last_buffer);
+		values[3] = Int32GetDatum(first_buffer);
 		nulls[3] = false;
 
-		values[4] = Int64GetDatum(complete_passes);
+		values[4] = Int32GetDatum(last_buffer);
 		nulls[4] = false;
 
-		values[5] = Int32GetDatum(next_victim_buffer);
+		values[5] = Int64GetDatum(complete_passes);
 		nulls[5] = false;
 
-		values[6] = Int64GetDatum(buffer_total_allocs);
+		values[6] = Int32GetDatum(next_victim_buffer);
 		nulls[6] = false;
 
-		values[7] = Int64GetDatum(buffer_allocs);
+		values[7] = Int64GetDatum(buffer_total_allocs);
 		nulls[7] = false;
 
-		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		values[8] = Int64GetDatum(buffer_allocs);
 		nulls[8] = false;
 
-		values[9] = Int64GetDatum(buffer_req_allocs);
+		values[9] = Int64GetDatum(buffer_total_req_allocs);
 		nulls[9] = false;
 
-		values[10] = PointerGetDatum(array);
+		values[10] = Int64GetDatum(buffer_req_allocs);
 		nulls[10] = false;
 
+		values[11] = PointerGetDatum(array);
+		nulls[11] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..78a0c5199f1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -19,6 +19,7 @@
  */
 #include "postgres.h"
 
+#include <numa.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/file.h>
@@ -602,6 +603,14 @@ CreateAnonymousSegment(Size *size)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * Set the memory policy to interleave to all NUMA nodes before calling
+	 * mmap, in case we use MAP_POPULATE to prefault all the pages.
+	 *
+	 * XXX Probably not needed without that, but also costs nothing.
+	 */
+	numa_set_membind(numa_all_nodes_ptr);
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -616,6 +625,9 @@ CreateAnonymousSegment(Size *size)
 
 		GetHugePageSize(&hugepagesize, &mmap_flags);
 
+		// prefault the memory at start?
+		// mmap_flags |= MAP_POPULATE;
+
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
@@ -638,13 +650,18 @@ CreateAnonymousSegment(Size *size)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		int			mmap_flags = 0;
+
+		// prefault the memory at start?
+		// mmap_flags |= MAP_POPULATE;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
 		allocsize = *size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
+				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 	}
 
@@ -663,6 +680,21 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
+	/* undo the earlier num_set_membind() call. */
+	numa_set_localalloc();
+
+	/*
+	 * Before touching the memory, set the allocation policy, so that
+	 * it gets interleaved by default. We have to do this to distribute
+	 * the memory that's not located explicitly. We need this especially
+	 * with huge pages, where we could run out of huge pages on some
+	 * nodes and crash otherwise.
+	 *
+	 * XXX Probably not needed with MAP_POPULATE, in which case the policy
+	 * was already set by num_set_membind() earlier. But doesn't hurt.
+	 */
+	numa_interleave_memory(ptr, allocsize, numa_all_nodes_ptr);
+
 	*size = allocsize;
 	return ptr;
 }
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0362fda24aa..587859a5754 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,12 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -29,15 +35,24 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
-/* *
- * number of buffer partitions */
-#define NUM_CLOCK_SWEEP_PARTITIONS	4
+/*
+ * Minimum number of buffer partitions, no matter the number of NUMA nodes.
+ */
+#define MIN_BUFFER_PARTITIONS	4
 
 /* Array of structs with information about buffer ranges */
 BufferPartitions *BufferPartitionsArray = NULL;
 
+static void buffer_partitions_prepare(void);
 static void buffer_partitions_init(void);
 
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;	/* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int	numa_buffers_per_node = -1; /* buffers per node */
+static int	numa_partitions = 0;	/* total (multiple of nodes) */
+
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -85,25 +100,85 @@ BufferManagerShmemInit(void)
 				foundIOCV,
 				foundBufCkpt,
 				foundParts;
+	Size		buffer_align;
+
+	/*
+	 * Determine the memory page size used to partition shared buffers over
+	 * the available NUMA nodes.
+	 *
+	 * XXX We have to call prepare again, because with EXEC_BACKEND we may not
+	 * see the values already calculated in BufferManagerShmemSize().
+	 *
+	 * XXX We need to be careful to get the same value when calculating the
+	 * and then later when initializing the structs after allocation, or to not
+	 * depend on that value too much. Before the allocation we don't know if we
+	 * get huge pages, so we just have to assume we do.
+	 */
+	buffer_partitions_prepare();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size. NUMA works on page
+	 * granularity, and we don't want a buffer to get split to multiple nodes
+	 * (when spanning multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (numa_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to numa_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(numa_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((numa_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % numa_page_size) == 0));
 
 	/* allocate the partition registry first */
 	BufferPartitionsArray = (BufferPartitions *)
 		ShmemInitStruct("Buffer Partitions",
 						offsetof(BufferPartitions, partitions) +
-						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						mul_size(sizeof(BufferPartition), numa_partitions),
 						&foundParts);
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -133,7 +208,10 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
-		/* Initialize buffer partitions (calculate buffer ranges). */
+		/*
+		 * Initialize buffer partitions, including moving memory to different
+		 * NUMA nodes (if enabled by GUC).
+		 */
 		buffer_partitions_init();
 
 		/*
@@ -172,19 +250,26 @@ BufferManagerShmemInit(void)
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
 
+	/* calculate partition info for buffers */
+	buffer_partitions_prepare();
+
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -201,11 +286,244 @@ BufferManagerShmemSize(void)
 
 	/* account for registry of NUMA partitions */
 	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
-								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+								   mul_size(sizeof(BufferPartition), numa_partitions)));
 
 	return size;
 }
 
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA partitioning */
+	if (numa_buffers_per_node == -1)
+		return 0;
+
+	/* no NUMA-aware partitioning */
+	if ((numa_flags & NUMA_BUFFERS) == 0)
+		return 0;
+
+	return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * buffer_partitions_prepare
+ *		Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches earlier in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+	/*
+	 * Minimum number of buffers we can allocate to a NUMA node (determined by
+	 * how many BufferDescriptors fit onto a memory page).
+	 */
+	int			min_node_buffers;
+
+	/*
+	 * Maximum number of nodes we can split shared buffers to, assuming each
+	 * node gets the smallest allocatable chunk (the last node can get a
+	 * smaller amount of memory, not the full chunk).
+	 */
+	int			max_nodes;
+
+	/*
+	 * How many partitions to create per node. Could be more than 1 for small
+	 * number of nodes (of non-NUMA systems).
+	 */
+	int			num_partitions_per_node;
+
+	/* bail out if already initialized (calculate only once) */
+	if (numa_nodes != -1)
+		return;
+
+	/* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+#if USE_LIBNUMA
+	numa_nodes = numa_num_configured_nodes();
+#else
+	/* without NUMA, assume there's just one node */
+	numa_nodes = 1;
+#endif
+
+	/* we should never get here without at least one NUMA node */
+	Assert(numa_nodes > 0);
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	numa_page_size = pg_numa_page_size();
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of buffers we can allocate from a single node, using
+	 * the memory page size (determined by buffer descriptors). NUMA allocates
+	 * memory in pages, and we need to do that for both buffers and
+	 * descriptors at the same time.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 */
+	min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+	/*
+	 * Maximum number of nodes (each getting min_node_buffers) we can handle
+	 * given the current shared buffers size. The last node is allowed to be
+	 * smaller (half of the other nodes).
+	 */
+	max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+	/*
+	 * Can we actually do NUMA partitioning with these settings? If we can't
+	 * handle the current number of nodes, then no.
+	 *
+	 * XXX This shouldn't be a big issue in practice. NUMA systems typically
+	 * run with large shared buffers, which also makes the imbalance issues
+	 * fairly significant (it's quick to rebalance 128MB, much slower to do
+	 * that for 256GB).
+	 */
+	numa_can_partition = true;	/* assume we can allocate to nodes */
+	if (numa_nodes > max_nodes)
+	{
+		elog(NOTICE, "shared buffers too small for %d nodes (max nodes %d)",
+			 numa_nodes, max_nodes);
+		numa_can_partition = false;
+	}
+	else if ((numa_flags & NUMA_BUFFERS) == 0)
+	{
+		elog(NOTICE, "NUMA-partitioning of buffers disabled");
+		numa_can_partition = false;
+	}
+
+	/*
+	 * We know we can partition to the desired number of nodes, now it's time
+	 * to figure out how many partitions we need per node. We simply add
+	 * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+	 *
+	 * XXX Maybe we should make sure to keep the actual partition size a power
+	 * of 2, to make the calculations simpler (shift instead of mod).
+	 */
+	num_partitions_per_node = 1;
+
+	while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+		num_partitions_per_node++;
+
+	/* now we know the total number of partitions */
+	numa_partitions = (numa_nodes * num_partitions_per_node);
+
+	/*
+	 * Finally, calculate how many buffers we'll assign to a single NUMA node.
+	 * If we have only a single node, or when we can't partition for some
+	 * reason, just take a "fair share" of buffers. This can happen for a
+	 * number of reasons - missing NUMA support, partitioning of buffers not
+	 * enabled, or not enough buffers for this many nodes.
+	 *
+	 * We still build partitions, because we want to allow partitioning of
+	 * the clock-sweep later.
+	 *
+	 * The number of buffers for each partition is calculated later, once we
+	 * have allocated the shared memory (because that's where we store it).
+	 *
+	 * XXX In both cases the last node can get fewer buffers.
+	 */
+	if (!numa_can_partition)
+	{
+		numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+	}
+	else
+	{
+		numa_buffers_per_node = min_node_buffers;
+		while (numa_buffers_per_node * numa_nodes < NBuffers)
+			numa_buffers_per_node += min_node_buffers;
+
+		/* the last node should get at least some buffers */
+		Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+	}
+
+	elog(DEBUG1, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+		 NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+		 numa_buffers_per_node, min_node_buffers);
+}
+
 /*
  * Sanity checks of buffers partitions - there must be no gaps, it must cover
  * the whole range of buffers, etc.
@@ -267,33 +585,137 @@ buffer_partitions_init(void)
 {
 	int			remaining_buffers = NBuffers;
 	int			buffer = 0;
+	int			parts_per_node = (numa_partitions / numa_nodes);
+	char	   *buffers_ptr,
+			   *descriptors_ptr;
 
-	/* number of buffers per partition (make sure to not overflow) */
-	int			part_buffers
-		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
-
-	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+	BufferPartitionsArray->npartitions = numa_partitions;
+	BufferPartitionsArray->nnodes = numa_nodes;
 
-	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	for (int n = 0; n < numa_nodes; n++)
 	{
-		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+		/* buffers this node should get (last node can get fewer) */
+		int			node_buffers = Min(remaining_buffers, numa_buffers_per_node);
 
-		/* buffers this partition should get (last partition can get fewer) */
-		int			num_buffers = Min(remaining_buffers, part_buffers);
+		/* split node buffers netween partitions (last one can get fewer) */
+		int			part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
 
-		remaining_buffers -= num_buffers;
+		remaining_buffers -= node_buffers;
 
-		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
-		Assert((buffer >= 0) && (buffer < NBuffers));
+		Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+		Assert((n >= 0) && (n < numa_nodes));
+
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int			idx = (n * parts_per_node) + p;
+			BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+			int			num_buffers = Min(node_buffers, part_buffers);
 
-		part->num_buffers = num_buffers;
-		part->first_buffer = buffer;
-		part->last_buffer = buffer + (num_buffers - 1);
+			Assert((idx >= 0) && (idx < numa_partitions));
+			Assert((buffer >= 0) && (buffer < NBuffers));
+			Assert((num_buffers > 0) && (num_buffers <= part_buffers));
 
-		buffer += num_buffers;
+			/* XXX we should get the actual node ID from the mask */
+			if (numa_can_partition)
+				part->numa_node = n;
+			else
+				part->numa_node = -1;
+
+			part->num_buffers = num_buffers;
+			part->first_buffer = buffer;
+			part->last_buffer = buffer + (num_buffers - 1);
+
+			elog(DEBUG1, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+			buffer += num_buffers;
+			node_buffers -= part_buffers;
+		}
 	}
 
 	AssertCheckBufferPartitions();
+
+	/*
+	 * With buffers interleaving disabled (or can't partition, because of
+	 * shared buffers being too small), we're done.
+	 */
+	if (((numa_flags & NUMA_BUFFERS) == 0) || !numa_can_partition)
+		return;
+
+	/*
+	 * Assign chunks of buffers and buffer descriptors to the available NUMA
+	 * nodes. We can't use the regular interleaving, because with regular
+	 * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+	 * NUMA nodes. And we don't want that.
+	 *
+	 * But even with huge pages it seems like a good idea to not map pages
+	 * one by one.
+	 *
+	 * So we always assign a larger contiguous chunk of buffers to the same
+	 * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+	 * chunks large enough to work both for buffers and buffer descriptors,
+	 * but not too large. See the comments at choose_chunk_buffers() for
+	 * details.
+	 *
+	 * Thanks to the earlier alignment (to memory page etc.), we know the
+	 * buffers won't get split, etc.
+	 *
+	 * This also makes it easier / straightforward to calculate which NUMA
+	 * node a buffer belongs to (it's a matter of divide + mod). See
+	 * BufferGetNode().
+	 *
+	 * We need to account for partitions being of different length, when the
+	 * NBuffers is not nicely divisible. To do that we keep track of the start
+	 * of the next partition.
+	 *
+	 * We always map all partitions for the same node at once, so that we
+	 * don't need to worry about alignment of memory pages that get split
+	 * between partitions (we only worry about min_node_buffers for whole
+	 * NUMA nodes, not for individual partitions).
+	 */
+	buffers_ptr = BufferBlocks;
+	descriptors_ptr = (char *) BufferDescriptors;
+
+	for (int n = 0; n < numa_nodes; n++)
+	{
+		char	   *startptr,
+				   *endptr;
+		int			num_buffers = 0;
+
+		/* sum buffers in all partitions for this node */
+		for (int p = 0; p < parts_per_node; p++)
+		{
+			int		pidx = (n * parts_per_node + p);
+			BufferPartition *part = &BufferPartitionsArray->partitions[pidx];
+
+			Assert(part->numa_node == n);
+
+			num_buffers += part->num_buffers;
+		}
+
+		/* first map buffers */
+		startptr = buffers_ptr;
+		endptr = startptr + ((Size) num_buffers * BLCKSZ);
+		buffers_ptr = endptr;	/* start of the next partition */
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => buffers %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+
+		/* now do the same for buffer descriptors */
+		startptr = descriptors_ptr;
+		endptr = startptr + ((Size) num_buffers * sizeof(BufferDescPadded));
+		descriptors_ptr = endptr;
+
+		elog(DEBUG1, "NUMA: buffer_partitions_init: %d => descriptors %d start %p end %p (size %zd)",
+			 n, num_buffers, startptr, endptr, (endptr - startptr));
+
+		pg_numa_move_to_node(startptr, endptr, n);
+	}
+
+	/* we should have consumed the arrays exactly */
+	Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+	Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
 }
 
 int
@@ -302,14 +724,21 @@ BufferPartitionCount(void)
 	return BufferPartitionsArray->npartitions;
 }
 
+int
+BufferPartitionNodes(void)
+{
+	return BufferPartitionsArray->nnodes;
+}
+
 void
-BufferPartitionGet(int idx, int *num_buffers,
+BufferPartitionGet(int idx, int *node, int *num_buffers,
 				   int *first_buffer, int *last_buffer)
 {
 	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
 	{
 		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
 
+		*node = part->numa_node;
 		*num_buffers = part->num_buffers;
 		*first_buffer = part->first_buffer;
 		*last_buffer = part->last_buffer;
@@ -322,8 +751,82 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 /* return parameters before the partitions are initialized (during sizing) */
 void
-BufferPartitionParams(int *num_partitions)
+BufferPartitionParams(int *num_partitions, int *num_nodes)
 {
 	if (num_partitions)
-		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+		*num_partitions = numa_partitions;
+
+	if (num_nodes)
+		*num_nodes = numa_nodes;
+}
+
+/* XXX the GUC hooks should probably be somewhere else? */
+bool
+check_debug_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	int			flags;
+
+#if USE_LIBNUMA == 0
+	if (strcmp(*newval, "") != 0)
+	{
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"debug_numa");
+		result = false;
+	}
+	flags = 0;
+#else
+	List	   *elemlist;
+	ListCell   *l;
+	char	   *rawstring;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	if (!SplitGUCList(rawstring, ',', &elemlist))
+	{
+		GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+							"debug_numa");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	flags = 0;
+	foreach(l, elemlist)
+	{
+		char	   *item = (char *) lfirst(l);
+
+		if (pg_strcasecmp(item, "buffers") == 0)
+			flags |= NUMA_BUFFERS;
+		else
+		{
+			GUC_check_errdetail("Invalid option \"%s\".", item);
+			result = false;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+#endif
+
+	if (!result)
+		return result;
+
+	/* Save the flags in *extra, for use by assign_debug_io_direct */
+	*extra = guc_malloc(LOG, sizeof(int));
+	if (!*extra)
+		return false;
+	*((int *) *extra) = flags;
+
+	return result;
+}
+
+void
+assign_debug_numa(const char *newval, void *extra)
+{
+	int		   *flags = (int *) extra;
+
+	numa_flags = *flags;
 }
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 8be77a9c8b1..810a549efce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -124,7 +124,9 @@ typedef struct
 	//int			__attribute__((aligned(64))) bgwprocno;
 
 	/* info about freelist partitioning */
+	int			num_nodes;		/* effectively number of NUMA nodes */
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	/* clocksweep partitions */
 	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
@@ -270,16 +272,72 @@ ClockSweepTick(ClockSweep *sweep)
  * calculate_partition_index
  *		calculate the buffer / clock-sweep partition to use
  *
- * use PID to determine the buffer partition
- *
- * XXX We could use NUMA node / core ID to pick partition, but we'd need
- * to handle cases with fewer nodes/cores than partitions somehow. Although,
- * maybe the balancing would handle that too.
+ * With libnuma, use the NUMA node and CPU to pick the partition. Otherwise
+ * use just PID instead of CPU (we assume everything is a single NUMA node).
  */
 static int
 calculate_partition_index(void)
 {
-	return (MyProcPid % StrategyControl->num_partitions);
+	int		cpu,
+			node,
+			index;
+
+	/*
+	 * The buffers are partitioned, so determine the CPU/NUMA node, and pick a
+	 * partition based on that.
+	 *
+	 * Without NUMA assume everything is a single NUMA node, and we pick the
+	 * partition based on PID (we may not have sched_getcpu).
+	 */
+#ifdef USE_LIBNUMA
+	cpu = sched_getcpu();
+
+	if (cpu < 0)
+		elog(ERROR, "sched_getcpu failed: %m");
+
+	node = numa_node_of_cpu(cpu);
+#else
+	cpu = MyProcPid;
+	node = 0;
+#endif
+
+	Assert(StrategyControl->num_partitions ==
+		   (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+	/*
+	 * XXX We should't get nodes that we haven't considered while building the
+	 * partitions. Maybe if we allow this (e.g. due to support adjusting the
+	 * NUMA stuff at runtime), we should just do our best to minimize the
+	 * conflicts somehow. But it'll make the mapping harder, so for now we
+	 * ignore it.
+	 */
+	if (node > StrategyControl->num_nodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+	/*
+	 * Find the partition. If we have a single partition per node, we can
+	 * calculate the index directly from node. Otherwise we need to do two
+	 * steps, using node and then cpu.
+	 */
+	if (StrategyControl->num_partitions_per_node == 1)
+	{
+		/* fast-path */
+		index = (node % StrategyControl->num_partitions);
+	}
+	else
+	{
+		int			index_group,
+					index_part;
+
+		/* two steps - calculate group from node, partition from cpu */
+		index_group = (node % StrategyControl->num_nodes);
+		index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+		index = (index_group * StrategyControl->num_partitions_per_node)
+			+ index_part;
+	}
+
+	return index;
 }
 
 /*
@@ -947,7 +1005,7 @@ StrategyShmemSize(void)
 	Size		size = 0;
 	int			num_partitions;
 
-	BufferPartitionParams(&num_partitions);
+	BufferPartitionParams(&num_partitions, NULL);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -974,9 +1032,17 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_nodes;
 	int			num_partitions;
+	int			num_partitions_per_node;
 
 	num_partitions = BufferPartitionCount();
+	num_nodes = BufferPartitionNodes();
+
+	/* always a multiple of NUMA nodes */
+	Assert(num_partitions % num_nodes == 0);
+
+	num_partitions_per_node = (num_partitions / num_nodes);
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
@@ -1011,7 +1077,8 @@ StrategyInitialize(bool init)
 		/* Initialize the clock sweep pointers (for all partitions) */
 		for (int i = 0; i < num_partitions; i++)
 		{
-			int			num_buffers,
+			int			node,
+						num_buffers,
 						first_buffer,
 						last_buffer;
 
@@ -1020,7 +1087,8 @@ StrategyInitialize(bool init)
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
 
 			/* get info about the buffer partition */
-			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+			BufferPartitionGet(i, &node, &num_buffers,
+							   &first_buffer, &last_buffer);
 
 			/*
 			 * FIXME This may not quite right, because if NBuffers is not a
@@ -1056,6 +1124,8 @@ StrategyInitialize(bool init)
 
 		/* initialize the partitioned clocksweep */
 		StrategyControl->num_partitions = num_partitions;
+		StrategyControl->num_nodes = num_nodes;
+		StrategyControl->num_partitions_per_node = num_partitions_per_node;
 	}
 	else
 		Assert(!init);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..b13e667ac9e 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -643,6 +643,16 @@
   options => 'debug_logical_replication_streaming_options',
 },
 
+{ name => 'debug_numa', type => 'string', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'NUMA-aware partitioning of shared memory.',
+  long_desc => 'An empty string disables NUMA-aware partitioning.',
+  flags => 'GUC_LIST_INPUT | GUC_NOT_IN_SAMPLE',
+  variable => 'debug_numa_string',
+  boot_val => '""',
+  check_hook => 'check_debug_numa',
+  assign_hook => 'assign_debug_numa',
+},
+
 { name => 'debug_parallel_query', type => 'enum', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Forces the planner\'s use parallel query nodes.',
   long_desc => 'This can be useful for testing the parallel query infrastructure by forcing the planner to generate plans that contain nodes that perform tuple communication between workers and the main process.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f87b558c2c6..3f29eeaf5ae 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -595,6 +595,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *debug_numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..9734aa315ff 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,12 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_page_size(void);
+extern PGDLLIMPORT void pg_numa_move_to_node(char *startptr, char *endptr, int node);
+
+extern PGDLLIMPORT int numa_flags;
+
+#define		NUMA_BUFFERS		0x01
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1118b386228..33377841c57 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -299,10 +299,10 @@ typedef struct BufferDesc
  * line sized.
  *
  * XXX: As this is primarily matters in highly concurrent workloads which
- * probably all are 64bit these days, and the space wastage would be a bit
- * more noticeable on 32bit systems, we don't force the stride to be cache
- * line sized on those. If somebody does actual performance testing, we can
- * reevaluate.
+ * probably all are 64bit these days. We force the stride to be cache line
+ * sized even on 32bit systems, where the space wastage is be a bit more
+ * noticeable, to allow partitioning of shared buffers (which requires the
+ * memory page be a multiple of buffer descriptor).
  *
  * Note that local buffer descriptors aren't forced to be aligned - as there's
  * no concurrent access to those it's unlikely to be beneficial.
@@ -312,7 +312,7 @@ typedef struct BufferDesc
  * platform with either 32 or 128 byte line sizes, it's good to align to
  * boundaries and avoid false sharing.
  */
-#define BUFFERDESC_PAD_TO_SIZE	(SIZEOF_VOID_P == 8 ? 64 : 1)
+#define BUFFERDESC_PAD_TO_SIZE	64
 
 typedef union BufferDescPadded
 {
@@ -555,8 +555,8 @@ extern void AtEOXact_LocalBuffers(bool isCommit);
 
 extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
-extern void BufferPartitionGet(int idx, int *num_buffers,
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
-extern void BufferPartitionParams(int *num_partitions);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4e7b1fcd4ab..510018db115 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,10 +156,12 @@ typedef struct ReadBuffersOperation ReadBuffersOperation;
 /*
  * information about one partition of shared buffers
  *
+ * numa_nod specifies node for this partition (-1 means allocated on any node)
  * first/last buffer - the values are inclusive
  */
 typedef struct BufferPartition
 {
+	int			numa_node;		/* NUMA node (-1 no node) */
 	int			num_buffers;	/* number of buffers */
 	int			first_buffer;	/* first buffer of partition */
 	int			last_buffer;	/* last buffer of partition */
@@ -169,6 +171,7 @@ typedef struct BufferPartition
 typedef struct BufferPartitions
 {
 	int			npartitions;	/* number of partitions */
+	int			nnodes;			/* number of NUMA nodes */
 	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
 } BufferPartitions;
 
@@ -346,6 +349,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..15304df0de5 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -175,4 +175,7 @@ extern bool check_synchronized_standby_slots(char **newval, void **extra,
 											 GucSource source);
 extern void assign_synchronized_standby_slots(const char *newval, void *extra);
 
+extern bool check_debug_numa(char **newval, void **extra, GucSource source);
+extern void assign_debug_numa(const char *newval, void *extra);
+
 #endif							/* GUC_HOOKS_H */
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 540ada3f8ef..d9c3841e078 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -18,6 +18,9 @@
 
 #include "miscadmin.h"
 #include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+int	numa_flags;
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -116,6 +119,36 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * pg_numa_move_to_node
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * node - node to move the memory to
+ *
+ * The "startptr" is expected to be a multiple of system memory page size, as
+ * determined by pg_numa_page_size.
+ *
+ * XXX We only expect to do this during startup, when the shared memory is
+ * still being setup.
+ */
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	Size		sz = (endptr - startptr);
+
+	Assert((int64) startptr % pg_numa_page_size() == 0);
+
+	/*
+	 * numa_tonode_memory does not actually cause a page fault, and thus does
+	 * not locate the memory on the node. So it's fast, at least compared to
+	 * pg_numa_query_pages, and does not make startup longer. But it also
+	 * means the expensive part happen later, on the first access.
+	 */
+	numa_tonode_memory(startptr, sz, node);
+}
+
 #else
 
 /* Empty wrappers */
@@ -138,4 +171,35 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+	/* we don't expect to ever get here in builds without libnuma */
+	Assert(false);
+}
+
 #endif
+
+Size
+pg_numa_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
-- 
2.51.1

v20251126-0005-clock-sweep-weighted-balancing.patchtext/x-patch; charset=UTF-8; name=v20251126-0005-clock-sweep-weighted-balancing.patchDownload

From 16013b08d9a524493c455278b1f4d97f237ec99e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 6 Aug 2025 01:09:57 +0200
Subject: [PATCH v20251126 5/9] clock-sweep: weighted balancing

The partitions may not be of exactly the same size, so consider that
when balancing clocksweep allocations.

Note: This may be more important with NUMA-aware partitioning, which
restricts the allowed sizes of partiions (especially with huge pages).
---
 src/backend/storage/buffer/freelist.c | 63 ++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3af82e267c6..8be77a9c8b1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -619,6 +619,20 @@ StrategySyncBalance(void)
 			avg_allocs,			/* average allocations (per partition) */
 			delta_allocs = 0;	/* sum of allocs above average */
 
+	/*
+	 * Size of a partition, used to calculate weighted average (the first
+	 * partition is expected to be the largest one, and so will be counted
+	 * as a "unit" partition with weight 1.0).
+	 */
+	int32	num_buffers = StrategyControl->sweeps[0].numBuffers;
+
+	/*
+	 * Total weight of partitions. If the partitions have the same size,
+	 * the weight should be equal the partition count (modulo rounding
+	 * errors, etc.)
+	 */
+	double	weight = 0.0;
+
 	/*
 	 * Collect the number of allocations requested in the past interval.
 	 * While at it, reset the counter to start the new interval.
@@ -645,16 +659,27 @@ StrategySyncBalance(void)
 		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
 
 		total_allocs += allocs[i];
+
+		/* weight of the partition, relative to the "unit" partition */
+		weight += (sweep->numBuffers * 1.0 / num_buffers);
 	}
 
 	/*
-	 * Calculate the "fair share" of allocations per partition.
+	 * XXX Not sure if the total_weight might exceed num_partitions due to
+	 * rounding errors.
+	 */
+	Assert((weight > 0.0) && (weight <= StrategyControl->num_partitions));
+
+	/*
+	 * Calculate the "fair share" of allocations per partition. This is the
+	 * number of allocations for the "unit" partition with num_buffers, we'll
+	 * need to adjust it using the per-partition weight.
 	 *
 	 * XXX The last partition could be smaller, in which case it should be
 	 * expected to handle fewer allocations. So this should be a weighted
 	 * average. But for now a simple average is good enough.
 	 */
-	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+	avg_allocs = (total_allocs / weight);
 
 	/*
 	 * Calculate the "delta" from balanced state, i.e. how many allocations
@@ -662,8 +687,14 @@ StrategySyncBalance(void)
 	 */
 	for (int i = 0; i < StrategyControl->num_partitions; i++)
 	{
-		if (allocs[i] > avg_allocs)
-			delta_allocs += (allocs[i] - avg_allocs);
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
+		if (allocs[i] > part_allocs)
+			delta_allocs += (allocs[i] - part_allocs);
 	}
 
 	/*
@@ -726,6 +757,10 @@ StrategySyncBalance(void)
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 		uint8		balance[MAX_BUFFER_PARTITIONS];
 
+		/* number of allocations expected for this partition */
+		double	part_weight = (sweep->numBuffers * 1.0 / num_buffers);
+		uint32	part_allocs = avg_allocs * part_weight;
+
 		/* lock, we're going to modify the balance weights */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 
@@ -733,7 +768,7 @@ StrategySyncBalance(void)
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
 		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		if (allocs[i] < part_allocs)
 		{
 			/* fewer - don't redirect any allocations elsewhere */
 			balance[i] = 100;
@@ -747,22 +782,30 @@ StrategySyncBalance(void)
 			 * a fraction proportional to (excess/delta) from this one.
 			 */
 
-			/* fraction of the "total" delta */
-			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+			/* fraction of the "total" delta represented by "excess" allocations */
+			double	delta_frac = (allocs[i] - part_allocs) * 1.0 / delta_allocs;
 
 			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			balance[i] = (100.0 * part_allocs / allocs[i]);
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
+				ClockSweep *sweep2 = &StrategyControl->sweeps[j];
+
+				/* number of allocations expected for this partition */
+				double	part_weight_2 = (sweep2->numBuffers * 1.0 / num_buffers);
+				uint32	part_allocs_2 = avg_allocs * part_weight_2;
+
 				/* How many allocations to receive from i-th partition? */
-				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				uint32	receive_allocs = delta_frac * (part_allocs_2 - allocs[j]);
 
 				/* ignore partitions that don't need additional allocations */
-				if (allocs[j] > avg_allocs)
+				if (allocs[j] > part_allocs_2)
 					continue;
 
+				Assert(receive_allocs >= 0);
+
 				/* fraction to redirect */
 				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
 			}
-- 
2.51.1

v20251126-0004-clock-sweep-scan-all-partitions.patchtext/x-patch; charset=UTF-8; name=v20251126-0004-clock-sweep-scan-all-partitions.patchDownload

From 5df379b3a4f9773de589120c3a609d2008663620 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 15 Oct 2025 13:59:29 +0200
Subject: [PATCH v20251126 4/9] clock-sweep: scan all partitions

When looking for a free buffer, scan all clock-sweep partitions, not
just the "home" one. All buffers in the home partition may be pinned, in
which case we should not fail. Instead, advance to the next partition,
in a round-robin way, and only fail after scanning through all of them.
---
 src/backend/storage/buffer/freelist.c     | 91 ++++++++++++++++-------
 src/test/recovery/t/027_stream_regress.pl |  5 --
 2 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 169071032b4..3af82e267c6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -167,6 +167,9 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 static ClockSweep *ChooseClockSweep(bool balance);
+static BufferDesc *StrategyGetBufferPartition(ClockSweep *sweep,
+											  BufferAccessStrategy strategy,
+											  uint32 *buf_state);
 
 /*
  * clocksweep allocation balancing
@@ -201,10 +204,9 @@ static int clocksweep_count = 0;
  * id of the buffer now under the hand.
  */
 static inline uint32
-ClockSweepTick(void)
+ClockSweepTick(ClockSweep *sweep)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -370,7 +372,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
+	ClockSweep *sweep,
+			   *sweep_start;		/* starting clock-sweep partition */
 
 	*from_ring = false;
 
@@ -424,37 +427,69 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
 	 *
-	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
-	 * buffers from a single partition, aligned with the NUMA node. That means
-	 * it only accesses buffers from the same NUMA node.
-	 *
-	 * XXX That also means each process "sweeps" only a fraction of buffers,
-	 * even if the other buffers are better candidates for eviction. Maybe
-	 * there should be some logic to "steal" buffers from other freelists or
-	 * other nodes?
+	 * Start with the "preferred" partition, and then proceed in a round-robin
+	 * manner. If we cycle back to the starting partition, it means none of the
+	 * partitions has unpinned buffers.
 	 *
-	 * XXX Would that also mean we'd have multiple bgwriters, one for each
-	 * node, or would one bgwriter handle all of that?
+	 * XXX Does this need to do similar balancing "balancing" as for bgwriter
+	 * in StrategySyncBalance? Maybe it's be enough to simply pick the initial
+	 * partition that way? We'd only getting a single buffer, so not much chance
+	 * to balance over many allocations.
 	 *
-	 * XXX This only searches a single partition, which can result in "no
-	 * unpinned buffers available" even if there are buffers in other
-	 * partitions. Should be fixed by falling back to other partitions if
-	 * needed.
-	 *
-	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
-	 * count for that one partition. In fact, this should not call ClockSweepTick
-	 * for every iteration. The call is likely quite expensive (does a lot
-	 * of stuff), and also may return a different partition on each call.
-	 * We should just do it once, and then do the for(;;) loop. And then
-	 * maybe advance to the next partition, until we scan through all of them.
+	 * XXX But actually, we're calling ChooseClockSweep() with balance=true, so
+	 * maybe it already does balancing?
 	 */
-	trycounter = NBuffers;
+	sweep = ChooseClockSweep(true);
+	sweep_start = sweep;
+	for (;;)
+	{
+		buf = StrategyGetBufferPartition(sweep, strategy, buf_state);
+
+		/* found a buffer in the "sweep" partition, we're done */
+		if (buf != NULL)
+			return buf;
+
+		/*
+		 * Try advancing to the next partition, round-robin (if last partition,
+		 * wrap around to the beginning).
+		 *
+		 * XXX This is a bit ugly, there must be a better way to advance to the
+		 * next partition.
+		 */
+		if (sweep == &StrategyControl->sweeps[StrategyControl->num_partitions - 1])
+			sweep = StrategyControl->sweeps;
+		else
+			sweep++;
+
+		/* we've scanned all partitions */
+		if (sweep == sweep_start)
+			break;
+	}
+
+	/* we shouldn't get here if there are unpinned buffers */
+	elog(ERROR, "no unpinned buffers available");
+}
+
+/*
+ * StrategyGetBufferPartition
+ *		get a free buffer from a single clock-sweep partition
+ *
+ * Returns NULL if there are no free (unpinned) buffers in the partition.
+*/
+static BufferDesc *
+StrategyGetBufferPartition(ClockSweep *sweep, BufferAccessStrategy strategy,
+						   uint32 *buf_state)
+{
+	BufferDesc *buf;
+	int			trycounter;
+
+	trycounter = sweep->numBuffers;
 	for (;;)
 	{
 		uint32		old_buf_state;
 		uint32		local_buf_state;
 
-		buf = GetBufferDescriptor(ClockSweepTick());
+		buf = GetBufferDescriptor(ClockSweepTick(sweep));
 
 		/*
 		 * Check whether the buffer can be used and pin it if so. Do this
@@ -482,7 +517,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 					 * one eventually, but it's probably better to fail than
 					 * to risk getting stuck in an infinite loop.
 					 */
-					elog(ERROR, "no unpinned buffers available");
+					return NULL;
 				}
 				break;
 			}
@@ -501,7 +536,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 				if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
 												   local_buf_state))
 				{
-					trycounter = NBuffers;
+					trycounter = sweep->numBuffers;
 					break;
 				}
 			}
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 98b146ed4b7..589c79d97d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,11 +18,6 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
-# The default is 1MB, which is not enough with clock-sweep partitioning.
-# Increase to 32MB, so that we don't get "no unpinned buffers".
-$node_primary->append_conf('postgresql.conf',
-	'shared_buffers = 32MB');
-
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
-- 
2.51.1

v20251126-0003-clock-sweep-balancing-of-allocations.patchtext/x-patch; charset=UTF-8; name=v20251126-0003-clock-sweep-balancing-of-allocations.patchDownload

From d5e333e0300db911e499e9bf2a7db95ff46e674d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 29 Oct 2025 21:45:34 +0100
Subject: [PATCH v20251126 3/9] clock-sweep: balancing of allocations

If backends only allocate buffers from the "home" partition, that may
cause significant misbalance. Some partitions might be overused, while
other partitions would be left unused. In other words, shared buffers
would not be used efficiently.

We want all partitions to be used about the same, i.e. serve about the
same number of allocations. To achieve that, allocations from partitions
that are "too busy" may get redirected to other partitions. The system
counts allocations requested from each partition, calculates the "fair
share" (average per partition), and then redirectsexcess allocations to
other partitions.

Each partition gets a set of coefficients determining the fraction of
allocations to redirect to other partitions. The coefficients may be
interpreted as a "budget" for each of the partition, i.e. the number of
allocations to serve from that partition, before moving to the next
partition (in a round-robin manner).

All of this is tied to the partition where the allocation was requested.
Each partition has a separate set of coefficients.

We might also treat the coefficients as probabilities, and use PRNG to
determine where to direct individual requests. But a PRNG seems fairly
expensive, and the budget approach works well.

We intentionally keep the "budget" fairly low, with the sum for a given
partition 100. That means we get to the same partition after only 100
allocations, keeping it more balanced. It wouldn't be hard to make the
budgets higher (e.g. matching the number of allocations per round), but
it might also make the behavior less smooth (long period of allocations
from each partition).

This is very simple/cheap, and over many allocations it has the same
effect. For periods of low activity it may diverge, but that does not
matter much (we care about high-activity periods much more).
---
 .../pg_buffercache--1.7--1.8.sql              |   5 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  43 +-
 src/backend/storage/buffer/bufmgr.c           |   3 +
 src/backend/storage/buffer/freelist.c         | 377 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   1 +
 src/include/storage/bufmgr.h                  |  12 +-
 6 files changed, 419 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
index f702a9db1a8..1834599c4b3 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
@@ -19,7 +19,10 @@ CREATE VIEW pg_buffercache_partitions AS
 	 num_passes bigint,			-- clocksweep passes
 	 next_buffer integer,		-- next victim buffer for clocksweep
 	 total_allocs bigint,		-- handled allocs (running total)
-	 num_allocs bigint);		-- handled allocs (current cycle)
+	 num_allocs bigint,			-- handled allocs (current cycle)
+	 total_req_allocs bigint,	-- requested allocs (running total)
+	 num_req_allocs bigint,		-- handled allocs (current cycle)
+	 weights int[]);			-- balancing weights
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index aed0ecb59e9..81665209084 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,8 @@
 #include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
 #include "utils/rel.h"
 
 
@@ -27,7 +29,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_OS_PAGES_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	11
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -844,6 +846,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
+		TypeCacheEntry *typentry = lookup_type_cache(INT4OID, 0);
+
 		funcctx = SRF_FIRSTCALL_INIT();
 
 		/* Switch context when allocating stuff to be used in later calls */
@@ -873,6 +877,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT8OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
 						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "total_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "num_req_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 11, "weigths",
+						   typentry->typarray, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -893,11 +903,17 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
-		uint64		buffer_total_allocs;
+		uint64		buffer_total_allocs,
+					buffer_total_req_allocs;
 
 		uint32		complete_passes,
 					next_victim_buffer,
-					buffer_allocs;
+					buffer_allocs,
+					buffer_req_allocs;
+
+		int		   *weights;
+		Datum	   *dweights;
+		ArrayType  *array;
 
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -906,8 +922,16 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   &first_buffer, &last_buffer);
 
 		ClockSweepPartitionGetInfo(i,
-								   &complete_passes, &next_victim_buffer,
-								   &buffer_total_allocs, &buffer_allocs);
+								 &complete_passes, &next_victim_buffer,
+								 &buffer_total_allocs, &buffer_allocs,
+								 &buffer_total_req_allocs, &buffer_req_allocs,
+								 &weights);
+
+		dweights = palloc_array(Datum, funcctx->max_calls);
+		for (int i = 0; i < funcctx->max_calls; i++)
+			dweights[i] = Int32GetDatum(weights[i]);
+
+		array = construct_array_builtin(dweights, funcctx->max_calls, INT4OID);
 
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
@@ -933,6 +957,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[7] = Int64GetDatum(buffer_allocs);
 		nulls[7] = false;
 
+		values[8] = Int64GetDatum(buffer_total_req_allocs);
+		nulls[8] = false;
+
+		values[9] = Int64GetDatum(buffer_req_allocs);
+		nulls[9] = false;
+
+		values[10] = PointerGetDatum(array);
+		nulls[10] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a3092ce801d..82c645a3b00 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3912,6 +3912,9 @@ BgBufferSync(WritebackContext *wb_context)
 	/* assume we can hibernate, any partition can set to false */
 	bool		hibernate = true;
 
+	/* trigger partition rebalancing first */
+	StrategySyncBalance();
+
 	/* get the number of clocksweep partitions, and total alloc count */
 	StrategySyncPrepare(&num_partitions, &recent_alloc);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index d40b09f7e69..169071032b4 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -34,6 +34,23 @@
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
+/*
+ * XXX needed for make ClockSweep fixed-size, should be tied to the number
+ * of buffer partitions (bufmgr.c already has MAX_CLOCKSWEEP_PARTITIONS, so
+ * at least set it to the same value).
+ */
+#define MAX_BUFFER_PARTITIONS		32
+
+/*
+ * Coefficient used to combine the old and new balance coefficients, using
+ * weighted average. The higher the value, the more the old value affects the
+ * result.
+ *
+ * XXX Doesn't this invalidate the interpretation as a probability to allocate
+ * from a given partition? Does it still sum to 100%?
+ */
+#define CLOCKSWEEP_HISTORY_COEFF	0.5
+
 /*
  * Information about one partition of the ClockSweep (on a subset of buffers).
  *
@@ -66,9 +83,28 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/*
+	 * Buffers that should have been allocated in this partition (but might
+	 * have been redirected to keep allocations balanced).
+	 */
+	pg_atomic_uint32 numRequestedAllocs;
+
 	/* running total of allocs */
 	pg_atomic_uint64 numTotalAllocs;
+	pg_atomic_uint64 numTotalRequestedAllocs;
 
+	/*
+	 * Weights to balance buffer allocations for all the partitions. Each
+	 * partition gets a vector of weights 0-100, determining what fraction
+	 * of buffers to allocate from that particular. So [75, 15, 5, 5] would
+	 * mean 75% allocations should go from partition 0, 15% from partition
+	 * 1, and 5% from partitions 2&3. Each partition gets a different vector
+	 * of weights.
+	 *
+	 * XXX Allocate a fixed-length array, to simplify working with array of
+	 * the structs, etc.
+	 */
+	uint8		balance[MAX_BUFFER_PARTITIONS];
 } ClockSweep;
 
 /*
@@ -130,7 +166,33 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
-static ClockSweep *ChooseClockSweep(void);
+static ClockSweep *ChooseClockSweep(bool balance);
+
+/*
+ * clocksweep allocation balancing
+ *
+ * To balance allocations from clocksweep partitions, each partition gets
+ * a set of "weights" determining the fraction of allocations to redirect
+ * to other partitions.
+ *
+ * We could do that based on a random number generator, but that seems too
+ * expensive. So instead we simply treat the probabilities as a budget, i.e.
+ * a number of allocations to serve from that partition, before moving to
+ * the next partition (in a round-robin manner).
+ *
+ * This is very simple/cheap, and over many allocations it has the same
+ * effect. For periods of low activity it may diverge, but that does not
+ * matter much (we care about high-activity periods much more).
+ *
+ * We intentionally keep the "budget" fairly low, with the sum for a given
+ * partition 100. That means we get to the same partition after only 100
+ * allocations, keeping it more balances. It wouldn't be hard to make the
+ * budgets higher (say, to match the expected number of allocations, i.e.
+ * about the average number of allocations from the past interval).
+ */
+static int clocksweep_partition_optimal = -1;
+static int clocksweep_partition = -1;
+static int clocksweep_count = 0;
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -142,7 +204,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
-	ClockSweep *sweep = ChooseClockSweep();
+	ClockSweep *sweep = ChooseClockSweep(true);
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -233,11 +295,59 @@ calculate_partition_index(void)
  * and that's cheaper. But how would that deal with odd number of nodes?
  */
 static ClockSweep *
-ChooseClockSweep(void)
+ChooseClockSweep(bool balance)
 {
-	int			index = calculate_partition_index();
+	/* What's the "optimal" partition? */
+	int		index = calculate_partition_index();
+	ClockSweep *sweep = &StrategyControl->sweeps[index];
+
+	/*
+	 * Did we migrate to a different core / NUMA node, affecting the
+	 * clocksweep partition we should use? Switch to that partition.
+	 */
+	if (clocksweep_partition_optimal != index)
+	{
+		clocksweep_partition_optimal = index;
+		clocksweep_partition = index;
+		clocksweep_count = sweep->balance[index];
+	}
+
+	/* we should have a valid partition */
+	Assert(clocksweep_partition_optimal != -1);
+	Assert(clocksweep_partition != -1);
+
+	/*
+	 * If rebalancing is enabled, use the weights to redirect the allocations
+	 * to match the desired distribution. We do that by using the partitions
+	 * in a round-robin way, after allocating the "weight" of allocations
+	 * from each partitions.
+	 */
+	if (balance)
+	{
+		/*
+		 * Ran out of allocations from the current partition? Move to the
+		 * next partition with non-zero weight, and use the weight as a
+		 * budget for allocations.
+		 */
+		while (clocksweep_count == 0)
+		{
+			clocksweep_partition
+				= (clocksweep_partition + 1) % StrategyControl->num_partitions;
+
+			Assert((clocksweep_partition >= 0) &&
+				   (clocksweep_partition < StrategyControl->num_partitions));
+
+			clocksweep_count = sweep->balance[clocksweep_partition];
+		}
 
-	return &StrategyControl->sweeps[index];
+		/* account for the allocation - take it from the budget */
+		--clocksweep_count;
+
+		/* account for the alloc in the "optimal" (original) partition */
+		pg_atomic_fetch_add_u32(&sweep->numRequestedAllocs, 1);
+	}
+
+	return &StrategyControl->sweeps[clocksweep_partition];
 }
 
 /*
@@ -309,7 +419,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep(false)->numBufferAllocs, 1);
 
 	/*
 	 * Use the "clock sweep" algorithm to find a free buffer
@@ -417,6 +527,224 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncBalance
+ *		update partition weights, to balance the buffer allocations
+ *
+ * We want to give preference to allocating buffers on the same NUMA node,
+ * but that might lead to imbalance - a single process would only use a
+ * fraction of shared buffers. We don't want that, we want to utilize the
+ * whole shared buffers. The number of allocations in each partition may
+ * also change over time, so we need to adapt to that.
+ *
+ * To allow this "adaptive balancing", each partition has a set of weights,
+ * determining what fraction of allocations to direct to other partitions.
+ * For simplicity the coefficients are integers 0-100, expressing the
+ * percentage of allocations redirected to that partition.
+ *
+ * Consider for example weights [50, 25, 25, 0] for one of 4 partitions.
+ * This means 50% of allocations will be redirected to partition 0, 25%
+ * to partitions 1 and 2, and no allocations will go to partition 3.
+ *
+ * To calculate these weights, assume we know the number of allocations
+ * requested for each partition in the past interval. We can use this to
+ * calculate weights for the following interval, aiming to allocate the
+ * same (fair share) number of buffers from each partition.
+ *
+ * Note: This is based on number of allocations "originating" in a given
+ * partition. If an allocation is requested in a partition A, it's counted
+ * as allocation for A, even if it gets redirected to some other partition.
+ * The patch addes a new counter to track this.
+ *
+ * The main observation is that partitions get divided into two groups,
+ * depending on whether the number allocations is higher or lower than the
+ * target average. But the "total delta" for these two groups is the
+ * same, i.e. sum(abs(allocs - avg_allocs)) is the same. Therefore, the
+ * task is to "distribute" the excess allocations between the partitions
+ * with not enough allocations.
+ *
+ * Partitions with (nallocs < avg_nallocs) don't redirect any allocations.
+ *
+ * Partitions with (nallocs > avg_nallocs) redirect the extra allocations,
+ * with each target allocation getting a proportional part (with respect
+ * to the total delta).
+ *
+ * XXX In principle we might do without the new "requestedAllocs" counter,
+ * but we'd need to solve the matrix equation Ax=b, with [A,b] known
+ * (weights and allocs), and calculate x (requested allocs). But it's not
+ * quite clear this'd always have a solution.
+ */
+void
+StrategySyncBalance(void)
+{
+	/* snapshot of allocs for partitions */
+	uint32	allocs[MAX_BUFFER_PARTITIONS];
+
+	uint32	total_allocs = 0,	/* total number of allocations */
+			avg_allocs,			/* average allocations (per partition) */
+			delta_allocs = 0;	/* sum of allocs above average */
+
+	/*
+	 * Collect the number of allocations requested in the past interval.
+	 * While at it, reset the counter to start the new interval.
+	 *
+	 * We lock the partitions one by one, so this is not exactly consistent
+	 * snapshot of the counts, and the resets happen before we update the
+	 * weights too. But we're only looking for heuristics anyway, so this
+	 * should be good enough.
+	 *
+	 * A similar issue applies to the counter reset - we haven't updated
+	 * the weights yet. Should be fine, we'll simply consider this in the
+	 * next balancing cycle.
+	 *
+	 * XXX Does this need to worry about the completePasses too?
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		/* no need for a spinlock */
+		allocs[i] = pg_atomic_exchange_u32(&sweep->numRequestedAllocs, 0);
+
+		/* add the allocs to running total */
+		pg_atomic_fetch_add_u64(&sweep->numTotalRequestedAllocs, allocs[i]);
+
+		total_allocs += allocs[i];
+	}
+
+	/*
+	 * Calculate the "fair share" of allocations per partition.
+	 *
+	 * XXX The last partition could be smaller, in which case it should be
+	 * expected to handle fewer allocations. So this should be a weighted
+	 * average. But for now a simple average is good enough.
+	 */
+	avg_allocs = (total_allocs / StrategyControl->num_partitions);
+
+	/*
+	 * Calculate the "delta" from balanced state, i.e. how many allocations
+	 * we'd need to redistribute.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		if (allocs[i] > avg_allocs)
+			delta_allocs += (allocs[i] - avg_allocs);
+	}
+
+	/*
+	 * Skip the rebalancing when there's not enough activity. In this case
+	 * we just keep the current weights.
+	 *
+	 * XXX The threshold of 100 allocation is pretty arbitrary.
+	 *
+	 * XXX Maybe a better strategy would be to slowly return to the default
+	 * weights, with each partition allocation only from itself?
+	 *
+	 * XXX Maybe we shouldn't even reset the counters in this case? But it
+	 * should not matter, if the activity is low.
+	 */
+	if (avg_allocs < 100)
+	{
+		elog(DEBUG1, "rebalance skipped: not enough allocations (allocs: %u)",
+			 avg_allocs);
+		return;
+	}
+
+	/*
+	 * Likewise, skip rebalancing if the misbalance is not significant. We
+	 * consider it acceptable if the amount of allocations we'd need to
+	 * redistribute is less than 10% of the average.
+	 *
+	 * XXX Again, these threshold are rather arbitrary.
+	 */
+	if (delta_allocs < (avg_allocs * 0.1))
+	{
+		elog(DEBUG1, "rebalance skipped: delta within limit (delta: %u, threshold: %u)",
+			 delta_allocs, (uint32) (avg_allocs * 0.1));
+		return;
+	}
+
+	/*
+	 * Got to do the rebalancing. Go through the partitions, and for each
+	 * partition decide if it gets to redirect or receive allocations.
+	 *
+	 * If a partition has fewer than average allocations, it won't redirect
+	 * any allocations to other partitions. So it only has a single non-zero
+	 * weight, and that's for itself.
+	 *
+	 * If a parttion has more than average allocations, it won't receive
+	 * any redirected allocations. Instead, the excess allocations are
+	 * redirected to the other partitions.
+	 *
+	 * The redistribution is "proportional" - if the excess allocations of
+	 * a partition represent 10% of the "delta", then each partition that
+	 * needs more allocations will get 10% of the gap from this one.
+	 *
+	 * XXX We should add hysteresis, to "dampen" the changes, and make
+	 * sure it does not oscillate too much.
+	 *
+	 * XXX Ideally, the alternative partitions to use first would be the
+	 * other partitions for the same node (if any).
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+		uint8		balance[MAX_BUFFER_PARTITIONS];
+
+		/* lock, we're going to modify the balance weights */
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+
+		/* reset the weights to start from scratch */
+		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
+
+		/* does this partition has fewer or more than avg_allocs? */
+		if (allocs[i] < avg_allocs)
+		{
+			/* fewer - don't redirect any allocations elsewhere */
+			balance[i] = 100;
+		}
+		else
+		{
+			/*
+			 * more - redistribute the excess allocations
+			 *
+			 * Each "target" partition (with less than avg_allocs) should get
+			 * a fraction proportional to (excess/delta) from this one.
+			 */
+
+			/* fraction of the "total" delta */
+			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
+
+			/* keep just enough allocations to meet the target */
+			balance[i] = (100.0 * avg_allocs / allocs[i]);
+
+			/* redirect the extra allocations */
+			for (int j = 0; j < StrategyControl->num_partitions; j++)
+			{
+				/* How many allocations to receive from i-th partition? */
+				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+
+				/* ignore partitions that don't need additional allocations */
+				if (allocs[j] > avg_allocs)
+					continue;
+
+				/* fraction to redirect */
+				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+			}
+		}
+
+		/* combine the old and new weights (hysteresis) */
+		for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+		{
+			sweep->balance[j]
+				= CLOCKSWEEP_HISTORY_COEFF * sweep->balance[j] +
+				  (1.0 - CLOCKSWEEP_HISTORY_COEFF) * balance[j];
+		}
+
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncPrepare -- prepare for sync of all partitions
  *
@@ -443,6 +771,7 @@ StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
 	{
 		ClockSweep *sweep = &StrategyControl->sweeps[i];
 
+		/* XXX we don't need the spinlock to read atomics, no? */
 		SpinLockAcquire(&sweep->clock_sweep_lock);
 		if (num_buf_alloc)
 		{
@@ -627,7 +956,21 @@ StrategyInitialize(bool init)
 			/* Clear statistics */
 			StrategyControl->sweeps[i].completePasses = 0;
 			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numRequestedAllocs, 0);
 			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalRequestedAllocs, 0);
+
+			/*
+			 * Initialize the weights - start by allocating 100% buffers from
+			 * the current node / partition.
+			 */
+			for (int j = 0; j < MAX_BUFFER_PARTITIONS; j++)
+			{
+				if (i == j)
+					StrategyControl->sweeps[i].balance[i] = 100;
+				else
+					StrategyControl->sweeps[i].balance[j] = 0;
+			}
 		}
 
 		/* No pending notification */
@@ -1001,8 +1344,10 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 void
 ClockSweepPartitionGetInfo(int idx,
-						   uint32 *complete_passes, uint32 *next_victim_buffer,
-						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+						 uint32 *complete_passes, uint32 *next_victim_buffer,
+						 uint64 *buffer_total_allocs, uint32 *buffer_allocs,
+						 uint64 *buffer_total_req_allocs, uint32 *buffer_req_allocs,
+						 int **weights)
 {
 	ClockSweep *sweep = &StrategyControl->sweeps[idx];
 
@@ -1010,11 +1355,21 @@ ClockSweepPartitionGetInfo(int idx,
 
 	/* get the clocksweep stats */
 	*complete_passes = sweep->completePasses;
+
+	/* calculate the actual buffer ID */
 	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
 
-	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
 
-	/* calculate the actual buffer ID */
-	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+	*buffer_total_req_allocs = pg_atomic_read_u64(&sweep->numTotalRequestedAllocs);
+	*buffer_req_allocs = pg_atomic_read_u32(&sweep->numRequestedAllocs);
+
+	/* return the weights in a newly allocated array */
+	*weights = palloc_array(int, StrategyControl->num_partitions);
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		(*weights)[i] = (int) sweep->balance[i];
+	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3307190f611..1118b386228 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -508,6 +508,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
+extern void StrategySyncBalance(void);
 extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
 extern int	StrategySyncStart(int partition, uint32 *complete_passes,
 							  int *first_buffer, int *num_buffers);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7052f9de57c..4e7b1fcd4ab 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -360,11 +360,13 @@ extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 extern void ClockSweepPartitionGetInfo(int idx,
-									   uint32 *complete_passes,
-									   uint32 *next_victim_buffer,
-									   uint64 *buffer_total_allocs,
-									   uint32 *buffer_allocs);
-
+									 uint32 *complete_passes,
+									 uint32 *next_victim_buffer,
+									 uint64 *buffer_total_allocs,
+									 uint32 *buffer_allocs,
+									 uint64 *buffer_total_req_allocs,
+									 uint32 *buffer_req_allocs,
+									 int **weights);
 
 /* inline functions */
 
-- 
2.51.1

v20251126-0002-clock-sweep-basic-partitioning.patchtext/x-patch; charset=UTF-8; name=v20251126-0002-clock-sweep-basic-partitioning.patchDownload

From 670ef95cfbec187fbc3e2cce641dec9bd79a092e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 11 Nov 2025 12:03:32 +0100
Subject: [PATCH v20251126 2/9] clock-sweep: basic partitioning

Partitions the "clock-sweep" algorithm to work on individual partitions,
one by one. Each backend process is mapped to one "home" partition, with
an independent clock hand. This reduces contention for workloads with
significant buffer pressure.

The patch extends the "pg_buffercache_partitions" view to include
information about the clock-sweep activity.

Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).

Note: There's a problem with some tests running out of unpinned buffers,
due to (intentionally) setting shared buffers very low. That happens
because StrategyGetBuffer() only searches a single partition, and it
has a couple more issues.
---
 .../pg_buffercache--1.7--1.8.sql              |   8 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  32 +-
 src/backend/storage/buffer/buf_init.c         |   8 +
 src/backend/storage/buffer/bufmgr.c           | 186 ++++++++----
 src/backend/storage/buffer/freelist.c         | 283 +++++++++++++++---
 src/include/storage/buf_internals.h           |   5 +-
 src/include/storage/bufmgr.h                  |   5 +
 src/test/recovery/t/027_stream_regress.pl     |   5 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 430 insertions(+), 103 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
index d62b8339bfc..f702a9db1a8 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
@@ -13,7 +13,13 @@ CREATE VIEW pg_buffercache_partitions AS
 	(partition integer,			-- partition index
 	 num_buffers integer,		-- number of buffers in the partition
 	 first_buffer integer,		-- first buffer of partition
-	 last_buffer integer);		-- last buffer of partition
+	 last_buffer integer,		-- last buffer of partition
+
+	 -- clocksweep counters
+	 num_passes bigint,			-- clocksweep passes
+	 next_buffer integer,		-- next victim buffer for clocksweep
+	 total_allocs bigint,		-- handled allocs (running total)
+	 num_allocs bigint);		-- handled allocs (current cycle)
 
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8c89855192f..aed0ecb59e9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_OS_PAGES_ELEM	3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	8
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -865,6 +865,14 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 						   INT4OID, -1, 0);
 		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
 						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "num_passes",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "next_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "total_allocs",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "num_allocs",
+						   INT8OID, -1, 0);
 
 		funcctx->user_fctx = BlessTupleDesc(tupledesc);
 
@@ -885,12 +893,22 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 					first_buffer,
 					last_buffer;
 
+		uint64		buffer_total_allocs;
+
+		uint32		complete_passes,
+					next_victim_buffer,
+					buffer_allocs;
+
 		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
 
 		BufferPartitionGet(i, &num_buffers,
 						   &first_buffer, &last_buffer);
 
+		ClockSweepPartitionGetInfo(i,
+								   &complete_passes, &next_victim_buffer,
+								   &buffer_total_allocs, &buffer_allocs);
+
 		values[0] = Int32GetDatum(i);
 		nulls[0] = false;
 
@@ -903,6 +921,18 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
 		values[3] = Int32GetDatum(last_buffer);
 		nulls[3] = false;
 
+		values[4] = Int64GetDatum(complete_passes);
+		nulls[4] = false;
+
+		values[5] = Int32GetDatum(next_victim_buffer);
+		nulls[5] = false;
+
+		values[6] = Int64GetDatum(buffer_total_allocs);
+		nulls[6] = false;
+
+		values[7] = Int64GetDatum(buffer_allocs);
+		nulls[7] = false;
+
 		/* Build and return the tuple. */
 		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
 		result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 528a368a8b7..0362fda24aa 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -319,3 +319,11 @@ BufferPartitionGet(int idx, int *num_buffers,
 
 	elog(ERROR, "invalid partition index");
 }
+
+/* return parameters before the partitions are initialized (during sizing) */
+void
+BufferPartitionParams(int *num_partitions)
+{
+	if (num_partitions)
+		*num_partitions = NUM_CLOCK_SWEEP_PARTITIONS;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..a3092ce801d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3608,33 +3608,29 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
  *
- * This is called periodically by the background writer process.
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
  *
- * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock-sweep
- * has been "lapped" and no buffer allocations have occurred recently,
- * or if the bgwriter has been effectively disabled by setting
- * bgwriter_lru_maxpages to 0.)
+ * XXX might be better to have a per-partition struct with all the info
  */
-bool
-BgBufferSync(WritebackContext *wb_context)
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int	prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int	next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
+static bool
+BgBufferSyncPartition(WritebackContext *wb_context, int num_partitions,
+					  int partition, int recent_alloc_partition)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
-
-	/*
-	 * Information saved between calls so we can determine the strategy
-	 * point's advance rate and avoid scanning already-cleaned buffers.
-	 */
-	static bool saved_info_valid = false;
-	static int	prev_strategy_buf_id;
-	static uint32 prev_strategy_passes;
-	static int	next_to_clean;
-	static uint32 next_passes;
 
 	/* Moving averages of allocation rate and clean-buffer density */
 	static float smoothed_alloc = 0;
@@ -3662,25 +3658,16 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	/* buffer range for the clocksweep partition */
+	int			first_buffer;
+	int			num_buffers;
+
 	/*
 	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
-
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.buf_alloc += recent_alloc;
-
-	/*
-	 * If we're not running the LRU scan, just stop after doing the stats
-	 * stuff.  We mark the saved state invalid so that we can recover sanely
-	 * if LRU scan is turned back on later.
-	 */
-	if (bgwriter_lru_maxpages <= 0)
-	{
-		saved_info_valid = false;
-		return true;
-	}
+	strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+										&first_buffer, &num_buffers);
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
@@ -3692,17 +3679,17 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta = strategy_passes - prev_strategy_passes[partition];
 
-		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
-		strategy_delta += (long) passes_delta * NBuffers;
+		strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+		strategy_delta += (long) passes_delta * num_buffers;
 
 		Assert(strategy_delta >= 0);
 
-		if ((int32) (next_passes - strategy_passes) > 0)
+		if ((int32) (next_passes[partition] - strategy_passes) > 0)
 		{
 			/* we're one pass ahead of the strategy point */
-			bufs_to_lap = strategy_buf_id - next_to_clean;
+			bufs_to_lap = strategy_buf_id - next_to_clean[partition];
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3710,11 +3697,11 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_delta, bufs_to_lap);
 #endif
 		}
-		else if (next_passes == strategy_passes &&
-				 next_to_clean >= strategy_buf_id)
+		else if (next_passes[partition] == strategy_passes &&
+				 next_to_clean[partition] >= strategy_buf_id)
 		{
 			/* on same pass, but ahead or at least not behind */
-			bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+			bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
 #ifdef BGW_DEBUG
 			elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
 				 next_passes, next_to_clean,
@@ -3734,9 +3721,9 @@ BgBufferSync(WritebackContext *wb_context)
 				 strategy_passes, strategy_buf_id,
 				 strategy_delta);
 #endif
-			next_to_clean = strategy_buf_id;
-			next_passes = strategy_passes;
-			bufs_to_lap = NBuffers;
+			next_to_clean[partition] = strategy_buf_id;
+			next_passes[partition] = strategy_passes;
+			bufs_to_lap = num_buffers;
 		}
 	}
 	else
@@ -3750,15 +3737,16 @@ BgBufferSync(WritebackContext *wb_context)
 			 strategy_passes, strategy_buf_id);
 #endif
 		strategy_delta = 0;
-		next_to_clean = strategy_buf_id;
-		next_passes = strategy_passes;
-		bufs_to_lap = NBuffers;
+		next_to_clean[partition] = strategy_buf_id;
+		next_passes[partition] = strategy_passes;
+		bufs_to_lap = num_buffers;
 	}
 
 	/* Update saved info for next time */
-	prev_strategy_buf_id = strategy_buf_id;
-	prev_strategy_passes = strategy_passes;
-	saved_info_valid = true;
+	prev_strategy_buf_id[partition] = strategy_buf_id;
+	prev_strategy_passes[partition] = strategy_passes;
+	/* XXX this needs to happen only after all partitions */
+	/* saved_info_valid = true; */
 
 	/*
 	 * Compute how many buffers had to be scanned for each new allocation, ie,
@@ -3766,9 +3754,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_partition > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -3778,7 +3766,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * strategy point and where we've scanned ahead to, based on the smoothed
 	 * density estimate.
 	 */
-	bufs_ahead = NBuffers - bufs_to_lap;
+	bufs_ahead = num_buffers - bufs_to_lap;
 	reusable_buffers_est = (float) bufs_ahead / smoothed_density;
 
 	/*
@@ -3786,10 +3774,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_partition)
+		smoothed_alloc = recent_alloc_partition;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -3816,7 +3804,7 @@ BgBufferSync(WritebackContext *wb_context)
 	 * the BGW will be called during the scan_whole_pool time; slice the
 	 * buffer pool into that many sections.
 	 */
-	min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+	min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
 
 	if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
 	{
@@ -3841,20 +3829,20 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
+		int			sync_state = SyncOneBuffer(next_to_clean[partition], true,
 											   wb_context);
 
-		if (++next_to_clean >= NBuffers)
+		if (++next_to_clean[partition] >= (first_buffer + num_buffers))
 		{
-			next_to_clean = 0;
-			next_passes++;
+			next_to_clean[partition] = first_buffer;
+			next_passes[partition]++;
 		}
 		num_to_scan--;
 
 		if (sync_state & BUF_WRITTEN)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
 			{
 				PendingBgWriterStats.maxwritten_clean++;
 				break;
@@ -3868,7 +3856,7 @@ BgBufferSync(WritebackContext *wb_context)
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
-		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
+		 recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
 		 smoothed_density, reusable_buffers_est, upcoming_alloc_est,
 		 bufs_to_lap - num_to_scan,
 		 num_written,
@@ -3898,8 +3886,74 @@ BgBufferSync(WritebackContext *wb_context)
 #endif
 	}
 
+	/* can this partition hibernate */
+	return (bufs_to_lap == 0 && recent_alloc_partition == 0);
+}
+
+/*
+ * BgBufferSync -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
+ * has been "lapped" and no buffer allocations have occurred recently,
+ * or if the bgwriter has been effectively disabled by setting
+ * bgwriter_lru_maxpages to 0.)
+ */
+bool
+BgBufferSync(WritebackContext *wb_context)
+{
+	/* info obtained from freelist.c */
+	uint32		recent_alloc;
+	uint32		recent_alloc_partition;
+	int			num_partitions;
+
+	/* assume we can hibernate, any partition can set to false */
+	bool		hibernate = true;
+
+	/* get the number of clocksweep partitions, and total alloc count */
+	StrategySyncPrepare(&num_partitions, &recent_alloc);
+
+	Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
+
+	/* Report buffer alloc counts to pgstat */
+	PendingBgWriterStats.buf_alloc += recent_alloc;
+
+	/* average alloc buffers per partition */
+	recent_alloc_partition = (recent_alloc / num_partitions);
+
+	/*
+	 * If we're not running the LRU scan, just stop after doing the stats
+	 * stuff.  We mark the saved state invalid so that we can recover sanely
+	 * if LRU scan is turned back on later.
+	 */
+	if (bgwriter_lru_maxpages <= 0)
+	{
+		saved_info_valid = false;
+		return true;
+	}
+
+	/*
+	 * now process the clocksweep partitions, one by one, using the same
+	 * cleanup that we used for all buffers
+	 *
+	 * XXX Maybe we should randomize the order of partitions a bit, so that we
+	 * don't start from partition 0 all the time? Perhaps not entirely, but at
+	 * least pick a random starting point?
+	 */
+	for (int partition = 0; partition < num_partitions; partition++)
+	{
+		/* hibernate if all partitions can hibernate */
+		hibernate &= BgBufferSyncPartition(wb_context, num_partitions,
+										   partition, recent_alloc_partition);
+	}
+
+	/* now that we've scanned all partitions, mark the cached info as valid */
+	saved_info_valid = true;
+
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return hibernate;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..d40b09f7e69 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,27 +15,47 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <sched.h>
+#endif
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
  */
 typedef struct
 {
 	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
+	slock_t		clock_sweep_lock;
+
+	/* range for this clock weep partition */
+	int32		firstBuffer;
+	int32		numBuffers;
 
 	/*
 	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
 	 * this isn't a concrete buffer - we only ever increase the value. So, to
 	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 *
+	 * XXX This is relative to firstBuffer, so needs to be offset properly.
+	 *
+	 * XXX firstBuffer + (nextVictimBuffer % numBuffers)
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
@@ -46,11 +66,32 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* running total of allocs */
+	pg_atomic_uint64 numTotalAllocs;
+
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
 	 */
 	int			bgwprocno;
+	// the _attribute_ does not work on Windows, it seems
+	//int			__attribute__((aligned(64))) bgwprocno;
+
+	/* info about freelist partitioning */
+	int			num_partitions;
+
+	/* clocksweep partitions */
+	ClockSweep	sweeps[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -89,6 +130,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -100,6 +142,7 @@ static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
+	ClockSweep *sweep = ChooseClockSweep();
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
@@ -107,14 +150,14 @@ ClockSweepTick(void)
 	 * apparent order.
 	 */
 	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+		pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
 
-	if (victim >= NBuffers)
+	if (victim >= sweep->numBuffers)
 	{
 		uint32		originalVictim = victim;
 
 		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+		victim = victim % sweep->numBuffers;
 
 		/*
 		 * If we're the one that just caused a wraparound, force
@@ -140,19 +183,61 @@ ClockSweepTick(void)
 				 * could lead to an overflow of nextVictimBuffers, but that's
 				 * highly unlikely and wouldn't be particularly harmful.
 				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+				SpinLockAcquire(&sweep->clock_sweep_lock);
 
-				wrapped = expected % NBuffers;
+				wrapped = expected % sweep->numBuffers;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+				success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
 														 &expected, wrapped);
 				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+					sweep->completePasses++;
+				SpinLockRelease(&sweep->clock_sweep_lock);
 			}
 		}
 	}
-	return victim;
+
+	/* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+	return sweep->firstBuffer + victim;
+}
+
+/*
+ * calculate_partition_index
+ *		calculate the buffer / clock-sweep partition to use
+ *
+ * use PID to determine the buffer partition
+ *
+ * XXX We could use NUMA node / core ID to pick partition, but we'd need
+ * to handle cases with fewer nodes/cores than partitions somehow. Although,
+ * maybe the balancing would handle that too.
+ */
+static int
+calculate_partition_index(void)
+{
+	return (MyProcPid % StrategyControl->num_partitions);
+}
+
+/*
+ * ChooseClockSweep
+ *		pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+	int			index = calculate_partition_index();
+
+	return &StrategyControl->sweeps[index];
 }
 
 /*
@@ -224,9 +309,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/*
+	 * Use the "clock sweep" algorithm to find a free buffer
+	 *
+	 * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+	 * buffers from a single partition, aligned with the NUMA node. That means
+	 * it only accesses buffers from the same NUMA node.
+	 *
+	 * XXX That also means each process "sweeps" only a fraction of buffers,
+	 * even if the other buffers are better candidates for eviction. Maybe
+	 * there should be some logic to "steal" buffers from other freelists or
+	 * other nodes?
+	 *
+	 * XXX Would that also mean we'd have multiple bgwriters, one for each
+	 * node, or would one bgwriter handle all of that?
+	 *
+	 * XXX This only searches a single partition, which can result in "no
+	 * unpinned buffers available" even if there are buffers in other
+	 * partitions. Should be fixed by falling back to other partitions if
+	 * needed.
+	 *
+	 * XXX Also, the trycounter should not be set to NBuffers, but to buffer
+	 * count for that one partition. In fact, this should not call ClockSweepTick
+	 * for every iteration. The call is likely quite expensive (does a lot
+	 * of stuff), and also may return a different partition on each call.
+	 * We should just do it once, and then do the for(;;) loop. And then
+	 * maybe advance to the next partition, until we scan through all of them.
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -306,6 +417,46 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * StrategySyncPrepare -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+	*num_buf_alloc = 0;
+	*num_parts = StrategyControl->num_partitions;
+
+	/*
+	 * We lock the partitions one by one, so not exacly in sync, but that
+	 * should be fine. We're only looking for heuristics anyway.
+	 */
+	for (int i = 0; i < StrategyControl->num_partitions; i++)
+	{
+		ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+		SpinLockAcquire(&sweep->clock_sweep_lock);
+		if (num_buf_alloc)
+		{
+			uint32	allocs = pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+
+			/* include the count in the running total */
+			pg_atomic_fetch_add_u64(&sweep->numTotalAllocs, allocs);
+
+			*num_buf_alloc += allocs;
+		}
+		SpinLockRelease(&sweep->clock_sweep_lock);
+	}
+}
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
@@ -313,37 +464,44 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+				  int *first_buffer, int *num_buffers)
 {
 	uint32		nextVictimBuffer;
 	int			result;
+	ClockSweep *sweep = &StrategyControl->sweeps[partition];
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+	SpinLockAcquire(&sweep->clock_sweep_lock);
+	nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+	result = nextVictimBuffer % sweep->numBuffers;
+
+	*first_buffer = sweep->firstBuffer;
+	*num_buffers = sweep->numBuffers;
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = sweep->completePasses;
 
 		/*
 		 * Additionally add the number of wraparounds that happened before
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes += nextVictimBuffer / sweep->numBuffers;
 	}
+	SpinLockRelease(&sweep->clock_sweep_lock);
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-	return result;
+	/* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+	Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+	return sweep->firstBuffer + result;
 }
 
 /*
@@ -380,6 +538,9 @@ Size
 StrategyShmemSize(void)
 {
 	Size		size = 0;
+	int			num_partitions;
+
+	BufferPartitionParams(&num_partitions);
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
@@ -387,6 +548,10 @@ StrategyShmemSize(void)
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
 
+	/* size of clocksweep partitions (at least one per NUMA node) */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+											num_partitions)));
+
 	return size;
 }
 
@@ -402,6 +567,10 @@ StrategyInitialize(bool init)
 {
 	bool		found;
 
+	int			num_partitions;
+
+	num_partitions = BufferPartitionCount();
+
 	/*
 	 * Initialize the shared buffer lookup hashtable.
 	 *
@@ -419,7 +588,8 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						MAXALIGN(offsetof(BufferStrategyControl, sweeps)) +
+						MAXALIGN(sizeof(ClockSweep) * num_partitions),
 						&found);
 
 	if (!found)
@@ -431,15 +601,40 @@ StrategyInitialize(bool init)
 
 		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize the clock sweep pointers (for all partitions) */
+		for (int i = 0; i < num_partitions; i++)
+		{
+			int			num_buffers,
+						first_buffer,
+						last_buffer;
+
+			SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+			/* get info about the buffer partition */
+			BufferPartitionGet(i, &num_buffers, &first_buffer, &last_buffer);
+
+			/*
+			 * FIXME This may not quite right, because if NBuffers is not a
+			 * perfect multiple of numBuffers, the last partition will have
+			 * numBuffers set too high. buf_init handles this by tracking the
+			 * remaining number of buffers, and not overflowing.
+			 */
+			StrategyControl->sweeps[i].numBuffers = num_buffers;
+			StrategyControl->sweeps[i].firstBuffer = first_buffer;
 
-		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+			/* Clear statistics */
+			StrategyControl->sweeps[i].completePasses = 0;
+			pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+			pg_atomic_init_u64(&StrategyControl->sweeps[i].numTotalAllocs, 0);
+		}
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/* initialize the partitioned clocksweep */
+		StrategyControl->num_partitions = num_partitions;
 	}
 	else
 		Assert(!init);
@@ -803,3 +998,23 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 
 	return true;
 }
+
+void
+ClockSweepPartitionGetInfo(int idx,
+						   uint32 *complete_passes, uint32 *next_victim_buffer,
+						   uint64 *buffer_total_allocs, uint32 *buffer_allocs)
+{
+	ClockSweep *sweep = &StrategyControl->sweeps[idx];
+
+	Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+	/* get the clocksweep stats */
+	*complete_passes = sweep->completePasses;
+	*next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+	*buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+	*buffer_total_allocs = pg_atomic_read_u64(&sweep->numTotalAllocs);
+
+	/* calculate the actual buffer ID */
+	*next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 139055a4a7d..3307190f611 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -508,7 +508,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int	StrategySyncStart(int partition, uint32 *complete_passes,
+							  int *first_buffer, int *num_buffers);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
@@ -554,5 +556,6 @@ extern int	BufferPartitionCount(void);
 extern int	BufferPartitionNodes(void);
 extern void BufferPartitionGet(int idx, int *num_buffers,
 							   int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions);
 
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 24860c6c2c4..7052f9de57c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -359,6 +359,11 @@ extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
 extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
+extern void ClockSweepPartitionGetInfo(int idx,
+									   uint32 *complete_passes,
+									   uint32 *next_victim_buffer,
+									   uint64 *buffer_total_allocs,
+									   uint32 *buffer_allocs);
 
 
 /* inline functions */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index 589c79d97d3..98b146ed4b7 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -18,6 +18,11 @@ $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf',
 	'max_prepared_transactions = 10');
 
+# The default is 1MB, which is not enough with clock-sweep partitioning.
+# Increase to 32MB, so that we don't get "no unpinned buffers".
+$node_primary->append_conf('postgresql.conf',
+	'shared_buffers = 32MB');
+
 # Enable pg_stat_statements to force tests to do query jumbling.
 # pg_stat_statements.max should be large enough to hold all the entries
 # of the regression database.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3edff346a8..241f175e9da 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
 ClientConnectionInfo
 ClientData
 ClientSocket
+ClockSweep
 ClonePtrType
 ClosePortalStmt
 ClosePtrType
-- 
2.51.1

v20251126-0001-Infrastructure-for-partitioning-shared-buf.patchtext/x-patch; charset=UTF-8; name=v20251126-0001-Infrastructure-for-partitioning-shared-buf.patchDownload

From abe7cf9f1314bdaae2361ad237ef859e00b69c07 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 17 Sep 2025 23:04:29 +0200
Subject: [PATCH v20251126 1/9] Infrastructure for partitioning shared buffers

The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, etc. This serves as a source of truth
for later patches (e.g. to partition clock-sweep).

The registry is a small BufferPartitions array in shared memory, with
partitions sized to be a fair share of shared buffers. Later patches may
improve this to consider NUMA, and similar details.

With the feature disabled (GUC set to empty list), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).

Notes:

* Maybe the number of partitions should be configurable? Right now it's
  hard-coded as 4, but testing shows increasing to e.g. 16) can be
  beneficial.

* This partitioning is independent of the partitions defined in
  lwlock.h, which defines 128 partitions to reduce lock conflict on the
  buffer mapping hashtable. The number of partitions introduced by this
  patch is expected to be much lower (a dozen or so).
---
 .../pg_buffercache--1.7--1.8.sql              |  23 +++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  86 +++++++++++
 src/backend/storage/buffer/buf_init.c         | 144 +++++++++++++++++-
 src/include/storage/buf_internals.h           |   6 +
 src/include/storage/bufmgr.h                  |  19 +++
 src/tools/pgindent/typedefs.list              |   2 +
 7 files changed, 280 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql

diff --git a/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
new file mode 100644
index 00000000000..d62b8339bfc
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.7--1.8.sql
@@ -0,0 +1,23 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.8'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+	SELECT P.* FROM pg_buffercache_partitions() AS P
+	(partition integer,			-- partition index
+	 num_buffers integer,		-- number of buffers in the partition
+	 first_buffer integer,		-- first buffer of partition
+	 last_buffer integer);		-- last buffer of partition
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 11499550945..d2fa8ba53ba 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.7'
+default_version = '1.8'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ae1712fc93c..8c89855192f 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
 
 #define NUM_BUFFERCACHE_OS_PAGES_ELEM	3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM	4
 
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
@@ -101,6 +102,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
 
 
 /* Only need to touch memory once per backend process lifetime */
@@ -826,3 +828,87 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "num_buffers",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "first_buffer",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "last_buffer",
+						   INT4OID, -1, 0);
+
+		funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = BufferPartitionCount();
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+
+		int			num_buffers,
+					first_buffer,
+					last_buffer;
+
+		Datum		values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+		BufferPartitionGet(i, &num_buffers,
+						   &first_buffer, &last_buffer);
+
+		values[0] = Int32GetDatum(i);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(num_buffers);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(first_buffer);
+		nulls[2] = false;
+
+		values[3] = Int32GetDatum(last_buffer);
+		nulls[3] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..528a368a8b7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -17,6 +17,11 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include "utils/varlena.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -24,6 +29,14 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/* *
+ * number of buffer partitions */
+#define NUM_CLOCK_SWEEP_PARTITIONS	4
+
+/* Array of structs with information about buffer ranges */
+BufferPartitions *BufferPartitionsArray = NULL;
+
+static void buffer_partitions_init(void);
 
 /*
  * Data Structures:
@@ -70,7 +83,15 @@ BufferManagerShmemInit(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOCV,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundParts;
+
+	/* allocate the partition registry first */
+	BufferPartitionsArray = (BufferPartitions *)
+		ShmemInitStruct("Buffer Partitions",
+						offsetof(BufferPartitions, partitions) +
+						mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS),
+						&foundParts);
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -112,6 +133,9 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/* Initialize buffer partitions (calculate buffer ranges). */
+		buffer_partitions_init();
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -175,5 +199,123 @@ BufferManagerShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* account for registry of NUMA partitions */
+	size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+								   mul_size(sizeof(BufferPartition), NUM_CLOCK_SWEEP_PARTITIONS)));
+
 	return size;
 }
+
+/*
+ * Sanity checks of buffers partitions - there must be no gaps, it must cover
+ * the whole range of buffers, etc.
+ */
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			num_buffers = 0;
+
+	Assert(BufferPartitionsArray->npartitions > 0);
+
+	for (int i = 0; i < BufferPartitionsArray->npartitions; i++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+		/*
+		 * We can get a single-buffer partition, if the sizing forces the last
+		 * partition to be just one buffer. But it's unlikely (and
+		 * undesirable).
+		 */
+		Assert(part->first_buffer <= part->last_buffer);
+		Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+		num_buffers += part->num_buffers;
+
+		/*
+		 * The first partition needs to start on buffer 0. Later partitions
+		 * need to be contiguous, without skipping any buffers.
+		 */
+		if (i == 0)
+		{
+			Assert(part->first_buffer == 0);
+		}
+		else
+		{
+			BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+			Assert((part->first_buffer - 1) == prev->last_buffer);
+		}
+
+		/* the last partition needs to end on buffer (NBuffers - 1) */
+		if (i == (BufferPartitionsArray->npartitions - 1))
+		{
+			Assert(part->last_buffer == (NBuffers - 1));
+		}
+	}
+
+	Assert(num_buffers == NBuffers);
+#endif
+}
+
+/*
+ * buffer_partitions_init
+ *		Initialize array of buffer partitions.
+ */
+static void
+buffer_partitions_init(void)
+{
+	int			remaining_buffers = NBuffers;
+	int			buffer = 0;
+
+	/* number of buffers per partition (make sure to not overflow) */
+	int			part_buffers
+		= ((int64) NBuffers + (NUM_CLOCK_SWEEP_PARTITIONS - 1)) / NUM_CLOCK_SWEEP_PARTITIONS;
+
+	BufferPartitionsArray->npartitions = NUM_CLOCK_SWEEP_PARTITIONS;
+
+	for (int n = 0; n < BufferPartitionsArray->npartitions; n++)
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[n];
+
+		/* buffers this partition should get (last partition can get fewer) */
+		int			num_buffers = Min(remaining_buffers, part_buffers);
+
+		remaining_buffers -= num_buffers;
+
+		Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+		Assert((buffer >= 0) && (buffer < NBuffers));
+
+		part->num_buffers = num_buffers;
+		part->first_buffer = buffer;
+		part->last_buffer = buffer + (num_buffers - 1);
+
+		buffer += num_buffers;
+	}
+
+	AssertCheckBufferPartitions();
+}
+
+int
+BufferPartitionCount(void)
+{
+	return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *num_buffers,
+				   int *first_buffer, int *last_buffer)
+{
+	if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+	{
+		BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+		*num_buffers = part->num_buffers;
+		*first_buffer = part->first_buffer;
+		*last_buffer = part->last_buffer;
+
+		return;
+	}
+
+	elog(ERROR, "invalid partition index");
+}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..139055a4a7d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -345,6 +345,7 @@ typedef struct WritebackContext
 
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
 extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
@@ -549,4 +550,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+extern int	BufferPartitionCount(void);
+extern int	BufferPartitionNodes(void);
+extern void BufferPartitionGet(int idx, int *num_buffers,
+							   int *first_buffer, int *last_buffer);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d42..24860c6c2c4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,25 @@ struct ReadBuffersOperation
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
 
+/*
+ * information about one partition of shared buffers
+ *
+ * first/last buffer - the values are inclusive
+ */
+typedef struct BufferPartition
+{
+	int			num_buffers;	/* number of buffers */
+	int			first_buffer;	/* first buffer of partition */
+	int			last_buffer;	/* last buffer of partition */
+} BufferPartition;
+
+/* an array of information about all partitions */
+typedef struct BufferPartitions
+{
+	int			npartitions;	/* number of partitions */
+	BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
 /* to avoid having to expose buf_internals.h here */
 typedef struct WritebackContext WritebackContext;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dfcd619bfee..d3edff346a8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,8 @@ BufferDescPadded
 BufferHeapTupleTableSlot
 BufferLookupEnt
 BufferManagerRelation
+BufferPartition
+BufferPartitions
 BufferStrategyControl
 BufferTag
 BufferUsage
-- 
2.51.1

#78

[1]: /messages/by-id/attachment/178120/NUMA_pq_cpu_pinning_results.txt

jakub.wartak@enterprisedb.com

about 1 month ago

In reply to: Tomas Vondra (#77)

1 attachment(s)

Re: Adding basic NUMA awareness

On Wed, Nov 26, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:

Rebased patch series attached.

Thanks. BTW still with the old patchset series, One additional thing
that I've found out related to interleave is that in
CreateAnonymousSegment() with the default check_debug='', we still
issue numa_interleave_memory(ptr..). It should be optional (this also
affects earlier calls too). Tiny patch attached.

I think the MAP_POPULATE should be optional, enabled by GUC.

OK, but you mean it's a new option to debug_numa, right? (not some
separate) so debug_numa='prefault' then?

I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.

I wouldn't expect a big difference for "pgbench -S". That workload has
so much other fairly expensive stuff (e.g. initializing index scans
etc.), the cost of buffer replacement is going to be fairly limited.

Right. OK, so I've got the seqconcurrentscans comparison done right,
that is when prewarmed and not naturally filled:

@master, 29GB/s mem bandwidth
latency average = 1255.572 ms
latency stddev = 417.162 ms
tps = 50.451925 (without initial connection time)

@v20251121 patchset, 41GB/s (~10GB/s per socket)
latency average = 719.931 ms
latency stddev = 14.874 ms
tps = 88.362091 (without initial connection time)

The main PMC difference seems to be much lower "backend cycles idle"
(51% master and vs 31% for the NUMA debug_numa="buffers,procs", so
less is waiting on memory, thus it gets that speedup and better IPC).

Anyway, the biggest gripe right now (at least to me) is reliable
benchmarking. Below runs are all apples and oranges comparisons (they
measure different stuff although looks the same initially)
- restart and just select pg_shmem_allocations_numa or prewarm puts
everything into 1 NUMA node with check_numa='', because of prefaulting
happening during select-view case
- restart and pgbench -i -s XX (same issue as above) then pgbench -
you get the same, everything on potential one NUMA node (because
pgbench prefaults just on one)
- restart and pgbench -c 64.. with debug_numa='' (off) MIGHT get
random NUMA layout, how's that is supposed to be deterministic? at
least with debug_numa='buffers' you get determinism..
- the shared_buffers size vs size of dataset read, the moment you
start doing something CPU intensive (or like calling syscalls just for
VFS cache), the benefit seems to disappear at least on my hardware

Anyway, depending on the scenario I could get varied results like
34tps .. 88tps here. The debug_numa='buffers,..' gives just assurance
of the proper layout of shared memory is there (one could even argue
that such performance deviations across runs are bug ;)).

The regressions for numa/pgproc patches with 1024 clients are annoying,
but how realistic is such scenario? With 32/64 CPUs, having 1024 active
connections is a substantial overload. If we can fix this, great. But I
think such regression may be OK if we get benefits for reasonable setups
(with fewer clients).

I don't know why it's happening, though. I haven't been testing cases
with so many clients (compared to the number of CPUs).

The only thing in my mind about deterioration of high-connection count
(AKA -c 1024 scenario) with pgprocs, would be related to the question
you raised in 0007 "Note: The scheduler may migrate the process to a
different CPU/node later. Maybe we should consider pinning the process
to the node?"

I think the answer is yes, so to fetch MyProc based on sched_getcpu()
and then maybe with additional numa_flags & new PROCS_PIN_NODE simply
numa_run_on_node(node)? I've tried this:

pgbench -c 1024 -j 64 -P 1 -T 30 -S -M prepared got:

@numa-empty-debug_numa ~434k TPS, ~12k CPU migrations/second
@numa+buffers+pgproc ~412k TPS, 7-8k CPU migrations/second
@numa+buffers+pgproc+pinnode ~434k TPS, still with 7-8k CPU
migrations/second (so same)
but I've verified for the last one, with bpftrace on that
tracepoint:sched:sched_migrate_task did not performed node-to-node
process bounces anymore (it did for pgbench but not for postgres
itself with this numa_run_on_node())

scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo

[..]

Hmmm. I'd have expected better results for this workload. So I tried
re-running my seqscan benchmark on the 176-core instance, and I got this:

[..]
Thanks!

I did the benchmark for individual parts of the patch series. There's a
clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The
0002/0003 regress it quite a bit. And with 128 clients there's no
improvement at all.

[..]

Those are clearly much better results, so I guess the default number of
partitions may be too low.

What bothers me is that this seems like a very narrow benchmark. I mean,
few systems are doing concurrent seqscans putting this much pressure on
buffer replacement. And once the plans start to do other stuff, the
contention on clock sweep seems to go down substantially (as shown by
the read-only pgbench). So the question is - is this really worth it?

Are you thinking here about whole NUMA patchset or just clocksweep? I
think multiple clocksweep are just not shining because other
bottlenecks hammer the efficiency here. Andres talk about it exactly
here https://youtu.be/V75KpACdl6E?t=1990 (He mentions out of order
execution, I see btrees in reports as top#1). So maybe it's just too
early to see the results of this optimization?

As for classic readonly pgbench -S I still see roughly 1:8 local to
remote (!) DRAM access (1 <-> 3 sockets) even with those patches, so
potentially something could be improved in far future for sure (that
would require some memaddr monitoring for most remote DRAM misses <->
pg inter-shm ptr mapping; think of pg_shmem_allocations_numa with
local/remote counters or maybe just fallback to perf-c2c).

To sum up, IMHO I understand this $thread's NUMA implementation as:
- it's strictly a guard mechanism to get determinism (for most cases)
-- it fixes "imbalance"
- no performance boost for OLTP as such
- for analytics it could be win (in-memory workloads; well PG is not
fully built for this, but it could be one day/or already is with 3rd
party TAMs and extensions), and:
-- we can provide performance jump for seqconcurrentjobs or memory
fitting workloads (patchset does this already). Note: I think PG will
eventually get into such classes in the longer run, we are just ahead
with NUMA, but PG is without proper vectorized executor stuff.
-- we could further enhance PQ here: the leader and PQ workers would
stick to the same NUMA node with some affinity (the earlier thread
measurements for this [1]/messages/by-id/attachment/178120/NUMA_pq_cpu_pinning_results.txt -- we could have session GUC to enable this
for planned big PQ whole-NUMA SELECTs; this would be probably done
close to dsm_impl_posix())
- new idea: we could allow exposing tables(spaces) into NUMA nodes or
make it per-user toggle too while we are at it (imagine HTAP-like
workloads: NUMA node #0 for OLTP, node #1 for analytics). Sounds cool
and rather easy and has valid use, but dunno if that would be really
useful?

Way out of scope:
- superlocking btress that Andres mentioned on his presentation

-J.

Attachments:

CreateAnonymousSegment.difftext/x-patch; charset=US-ASCII; name=CreateAnonymousSegment.diffDownload

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 6019bee334d..e5c4752d9f6 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,6 +30,7 @@
 
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "port/pg_numa.h"
 #include "portability/mem.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
@@ -609,7 +610,8 @@ CreateAnonymousSegment(Size *size)
 	 *
 	 * XXX Probably not needed without that, but also costs nothing.
 	 */
-	numa_set_membind(numa_all_nodes_ptr);
+	if ((numa_flags & NUMA_BUFFERS) != 0)
+		numa_set_membind(numa_all_nodes_ptr);
 
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
@@ -681,7 +683,8 @@ CreateAnonymousSegment(Size *size)
 	}
 
 	/* undo the earlier num_set_membind() call. */
-	numa_set_localalloc();
+	if ((numa_flags & NUMA_BUFFERS) != 0)
+		numa_set_localalloc();
 
 	/*
 	 * Before touching the memory, set the allocation policy, so that
@@ -693,7 +696,8 @@ CreateAnonymousSegment(Size *size)
 	 * XXX Probably not needed with MAP_POPULATE, in which case the policy
 	 * was already set by num_set_membind() earlier. But doesn't hurt.
 	 */
-	numa_interleave_memory(ptr, allocsize, numa_all_nodes_ptr);
+	if ((numa_flags & NUMA_BUFFERS) != 0)
+		numa_interleave_memory(ptr, allocsize, numa_all_nodes_ptr);
 
 	*size = allocsize;
 	return ptr;

#79

[1]: /messages/by-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me
/messages/by-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me

tomas@vondra.me

about 1 month ago

In reply to: Jakub Wartak (#78)

1 attachment(s)

Re: Adding basic NUMA awareness

Hi,

I've spent the last couple days considering what to do about this patch
series in this thread. The approach assumes partitioning shared memory
in a NUMA-aware way has enough benefits to justify the extra complexity.
I'm getting somewhat skeptical about that being a good trade off.

We've been able to demonstrate some significant benefits of the patches.
See for example the results from about a month ago [1]/messages/by-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me, showing that in
some cases the throughput almost doubles.

But I'm skeptical about this for two reasons:

* Most of the benefit comes from patches unrelated to NUMA. The initial
patches partition clockweep, in a NUMA oblivious way. In fact, applying
the NUMA patches often *reduces* the throughput. So if we're concerned
about contention on the clocksweep hand, we could apply just these first
patches. That way we wouldn't have to deal with huge pages.

* Furthermore, I'm not quite sure clocksweep really is a bottleneck in
realistic cases. The benchmark used in this thread does many concurrent
sequential scans, on data that exceeds shared buffers / fits into RAM.
Perhaps that happens, but I doubt it's all that common.

I've been unable to demonstrate any benefits on other workloads, even if
there's a lot of buffer misses / reads into shared buffers. As soon as
the query starts doing something else, the clocksweep contention becomes
a non-issue. Consider for example read-only pgbench with database much
larger than shared buffers (but still within RAM). The cost of the index
scans (and other nodes) seems to reduce the pressure on clocksweep.

So I'm skeptical about clocksweep pressure being a serious issue, except
for some very narrow benchmarks (like the concurrent seqscan test). And
even if this happened for some realistic cases, partitioning the buffers
in a NUMA-oblivious way seems to do the trick.

When discussing this stuff off list, it was suggested this might help
with the scenario Andres presented in [3]https://youtu.be/V75KpACdl6E?t=2120 -- Tomas Vondra, where the throughput improves
a lot with multiple databases. I've not observed that in practice, and I
don't think these patches really can help with that. That scenario is
about buffer lock contention, not clocksweep contention.

With a single database there's nothing to do - there's one contended
page, located on a single node. There'll be contention, the no matter
which node it ends up on. With multiple databases (or multiple root
pages), it either happens to work by chance (if the buffers happen to be
from different nodes), or it would require figuring out which buffers
are busy, and place them on different nodes. But the patches did not
even try to do anything like that. So it still was a matter of chance.

That does not mean we don't need to worry about NUMA. There's still the
issue of misbalancing the allocation - with memory coming from just one
node, etc. Which is an issue because it "overloads" the memory subsystem
on that particular node. But that can be solved simply by interleaving
the shared memory segment.

That sidesteps most of the complexity - it does not need to figure out
how to partition buffers, does not need to worry about huge pages, etc.
This is not a new idea - it's more or less what Jakub Wartak initially
proposed in [2]/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com, before I hijacked the thread into the more complex
approach.

Attached is a tiny patch doing mostly what Jakub did, except that it
does two things. First, it allows interleaving the shared memory on all
relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows
populating all memory by setting MAP_POPULATE in mmap(). There's a new
GUC to enable each of these.

I think we should try this (much simpler) approach first, or something
close to it. Sorry for dragging everyone into a much more complex
approach, which now seems to be a dead end.

regards

[2]: /messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com
/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com

[3]: https://youtu.be/V75KpACdl6E?t=2120 -- Tomas Vondra
--
Tomas Vondra

Attachments:

v20251208-0001-numa-Simple-interleaving-and-MAP_POPULATE.patchtext/x-patch; charset=UTF-8; name=v20251208-0001-numa-Simple-interleaving-and-MAP_POPULATE.patchDownload

From 72b0a9b86da7e1a6e239f5fdead24aeaa3d99f27 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 5 Dec 2025 14:53:47 +0100
Subject: [PATCH v20251208] numa: Simple interleaving and MAP_POPULATE

Allows NUMA interleaving on the shared memory segment, to ensure memory
is balanced between the NUMA nodes. The patch also allows prefaulting
the shared memory (by setting MAP_POPULATE flag), to actually allocate
the pages on nodes.

The commit addes two GUC parameters, both set to 'off' by default:

- shared_memory_interleave (enables NUMA interleaving)

- shared_memory_populate (sets MAP_POPULATE)

The memory is interleaved on all nodes enabled in the cpuset, as
returned by numa_get_mems_allowed(). The interleaving is applied at the
memory page granularity, and is oblivious to what's stored in it.
---
 src/backend/port/sysv_shmem.c                 | 51 ++++++++++++++++++-
 src/backend/utils/misc/guc_parameters.dat     | 12 +++++
 src/backend/utils/misc/guc_tables.c           |  2 +
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/port/pg_numa.h                    |  4 ++
 src/include/storage/pg_shmem.h                |  3 ++
 src/port/pg_numa.c                            | 42 +++++++++++++++
 7 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 298ceb3e218..b300f5ef4a7 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -19,6 +19,7 @@
  */
 #include "postgres.h"
 
+#include <numa.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/file.h>
@@ -29,6 +30,7 @@
 
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "port/pg_numa.h"
 #include "portability/mem.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
@@ -602,6 +604,28 @@ CreateAnonymousSegment(Size *size)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
+	/*
+	 * When asked to use NUMA-interleaving, we need to set the policy before
+	 * touching the memory. By default that'll happen later, when either
+	 * initializing some of the shmem structures (e.g. buffer descriptors), or
+	 * when running queries. In that case it's enough to set the policy after
+	 * the mmap() call, and we don't need to do anything here.
+	 *
+	 * With MAP_POPULATE, the mmap() itself will prefault the pages, so we
+	 * need to set the policy to interleave before the mmap() call, and then
+	 * revert to localalloc (so that private memory is allocated locally).
+	 *
+	 * XXX It probably is not a good idea to enable interleaving with regular
+	 * memory pages, because then each buffer will get split on two nodes, and
+	 * the system won't be able to fix that by migrating one of the pages. But
+	 * we leave that up to the admin, instead of forbidding it.
+	 */
+	if (shared_memory_interleave && shared_memory_populate)
+	{
+		/* set the allocation to interleave on nodes allowed by the cpuset */
+		pg_numa_set_interleave();
+	}
+
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
 	Assert(huge_pages != HUGE_PAGES_ON);
@@ -619,6 +643,10 @@ CreateAnonymousSegment(Size *size)
 		if (allocsize % hugepagesize != 0)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
+		/* populate the shared memory if requested */
+		if (shared_memory_populate)
+			mmap_flags |= MAP_POPULATE;
+
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
 				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
@@ -638,13 +666,19 @@ CreateAnonymousSegment(Size *size)
 
 	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
 	{
+		int			mmap_flags = 0;
+
+		/* populate the shared memory if requested */
+		if (shared_memory_populate)
+			mmap_flags |= MAP_POPULATE;
+
 		/*
 		 * Use the original size, not the rounded-up value, when falling back
 		 * to non-huge pages.
 		 */
 		allocsize = *size;
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS, -1, 0);
+				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
 		mmap_errno = errno;
 	}
 
@@ -663,6 +697,21 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
+	/*
+	 * With NUMA interleaving, we need to either apply interleaving for the
+	 * shmem segment we just allocated, or reset the memory policy to local
+	 * allocation (when using MAP_POPULATE).
+	 */
+	if (shared_memory_interleave)
+	{
+		if (shared_memory_populate)
+			/* revert back to using the local node */
+			pg_numa_set_localalloc();
+		else
+			/* apply interleaving to the new memory segment */
+			pg_numa_interleave_memory(ptr, allocsize);
+	}
+
 	*size = allocsize;
 	return ptr;
 }
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..ce1c0c4327f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2597,6 +2597,18 @@
   max => 'INT_MAX / 2',
 },
 
+{ name => 'shared_memory_interleave', type => 'bool', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+  short_desc => 'Enables NUMA interleaving of shared memory.',
+  variable => 'shared_memory_interleave',
+  boot_val => 'false',
+},
+
+{ name => 'shared_memory_populate', type => 'bool', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM',
+  short_desc => 'Populates shared memory at start.',
+  variable => 'shared_memory_populate',
+  boot_val => 'false',
+},
+
 { name => 'shared_memory_size', type => 'int', context => 'PGC_INTERNAL', group => 'PRESET_OPTIONS',
   short_desc => 'Shows the size of the server\'s main shared memory area (rounded up to the nearest MB).',
   flags => 'GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE | GUC_UNIT_MB | GUC_RUNTIME_COMPUTED',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f87b558c2c6..cfee0df987f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -580,6 +580,8 @@ static int	ssl_renegotiation_limit;
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
 int			huge_pages_status = HUGE_PAGES_UNKNOWN;
+bool		shared_memory_interleave = false;
+bool		shared_memory_populate = false;
 
 /*
  * These variables are all dummies that don't do anything, except in some
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..de1276f6897 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -152,6 +152,8 @@
                                         #   sysv
                                         #   windows
                                         # (change requires restart)
+#shared_memory_interleave = off         # interleave all memory on available NUMA nodes
+#shared_memory_populate = off           # prefault shared memory on start
 #dynamic_shared_memory_type = posix     # the default is usually the first option
                                         # supported by the operating system:
                                         #   posix
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 9d1ea6d0db8..dc9f13d9fa0 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -18,6 +18,10 @@ extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
 
+extern PGDLLIMPORT void pg_numa_set_interleave(void);
+extern PGDLLIMPORT void pg_numa_set_localalloc(void);
+extern PGDLLIMPORT void pg_numa_interleave_memory(void *ptr, Size size);
+
 #ifdef USE_LIBNUMA
 
 /*
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..7b56bd5b44f 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -47,6 +47,9 @@ extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
 
+extern PGDLLIMPORT bool shared_memory_interleave;
+extern PGDLLIMPORT bool shared_memory_populate;
+
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
 {
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 540ada3f8ef..a91d339033f 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -116,6 +116,33 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+/*
+ * Set allocation memory to interleave on all memory nodes in the cpuset.
+ */
+void
+pg_numa_set_interleave(void)
+{
+	numa_set_membind(numa_get_mems_allowed());
+}
+
+/*
+ * Set allocation memory to localalloc.
+ */
+void
+pg_numa_set_localalloc(void)
+{
+	numa_set_localalloc();
+}
+
+/*
+ * Set policy for memory to interleaving (on all nodes per cpuset).
+ */
+void
+pg_numa_interleave_memory(void *ptr, Size size)
+{
+	numa_interleave_memory(ptr, size, numa_get_mems_allowed());
+}
+
 #else
 
 /* Empty wrappers */
@@ -138,4 +165,19 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+void
+pg_numa_set_interleave(void)
+{
+}
+
+void
+pg_numa_set_localalloc(void)
+{
+}
+
+void
+pg_numa_interleave_memory(void *ptr, Size size)
+{
+}
+
 #endif
-- 
2.51.1

#80