Adding basic NUMA awareness
Hi,
This is a WIP version of a patch series I'm working on, adding some
basic NUMA awareness for a couple parts of our shared memory (shared
buffers, etc.). It's based on Andres' experimental patches he spoke
about at pgconf.eu 2024 [1]https://www.youtube.com/watch?v=V75KpACdl6E, and while it's improved and polished in
various ways, it's still experimental.
But there's a recent thread aiming to do something similar [2]/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com, so
better to share it now so that we can discuss both approaches. This
patch set is a bit more ambitious, handling NUMA in a way to allow
smarter optimizations later, so I'm posting it in a separate thread.
The series is split into patches addressing different parts of the
shared memory, starting (unsurprisingly) from shared buffers, then
buffer freelists and ProcArray. There's a couple additional parts, but
those are smaller / addressing miscellaneous stuff.
Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.
Each patch should have a commit message explaining the intent and
implementation, and then also detailed comments explaining various
challenges and open questions.
But let me go over the basics, and discuss some of the design choices
and open questions that need solving.
1) v1-0001-NUMA-interleaving-buffers.patch
This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.
Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).
So the patch handles this by explicitly mapping chunks of shared buffers
to different nodes - a bit like interleaving, but in larger chunks.
Ideally each node gets (1/N) of shared buffers, as a contiguous chunk.
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.
There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.
The patch is fairly simple. Most of the complexity is about picking the
chunk size, and aligning the arrays (so that it nicely aligns with
memory pages).
The patch has a GUC "numa_buffers_interleave", with "off" by default.
2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.
The patch has a GUC "numa_localalloc", with "off" by default.
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.
4) v1-0004-NUMA-partition-buffer-freelist.patch
Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:
* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.
* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.
* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.
* none - nothing, sigle freelist
Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.
The main challenge is that with multiple smaller lists, a process can't
really use the whole shared buffers. So a single backed will only use
part of the memory. The more lists there are, the worse this effect is.
This is also why I think we won't use the other partitioning options,
because there's going to be more CPUs than NUMA nodes.
Obviously, this needs solving even with NUMA nodes - we need to allow a
single backend to utilize the whole shared buffers if needed. There
should be a way to "steal" buffers from other freelists (if the
"regular" freelist is empty), but the patch does not implement this.
Shouldn't be hard, I think.
The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).
I do have a separate experimental patch doing something like that, I
need to make it part of this branch.
5) v1-0005-NUMA-interleave-PGPROC-entries.patch
Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because
(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).
(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.
The fast-path arrays are less of a problem, because those tend to be
larger, and are accessed through pointers, so we can just adjust that.
So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.
This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.
But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.
There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.
There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).
This has a GUC "numa_procs_interleave", again "off" by default. It's not
quite correct, though, because the partitioning happens always. It only
affects the PGPROC lookup. (In a way, this may be a bit broken.)
6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.
Driven by GUC "numa_procs_pin" (default: off).
Summary
-------
So this is what I have at the moment. I've tried to organize the patches
in the order of importance, but that's just my guess. It's entirely
possible there's something I missed, some other order might make more
sense, etc.
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.
But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).
I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2]/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com proposed to do).
regards
[1]: https://www.youtube.com/watch?v=V75KpACdl6E
[2]: /messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com
/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com
--
Tomas Vondra
Attachments:
v1-0001-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v1-0001-NUMA-interleaving-buffers.patchDownload+427-42
v1-0002-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v1-0002-NUMA-localalloc.patchDownload+28-1
v1-0003-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v1-0003-freelist-Don-t-track-tail-of-a-freelist.patchDownload+0-10
v1-0004-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v1-0004-NUMA-partition-buffer-freelist.patchDownload+327-30
v1-0005-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v1-0005-NUMA-interleave-PGPROC-entries.patchDownload+407-63
v1-0006-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v1-0006-NUMA-pin-backends-to-NUMA-nodes.patchDownload+33-1
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.
The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.
This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?
If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.
I have added Dmitry to this thread since he has written most of the
shared memory handling code.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.
The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?
But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.
Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.
--
Best Wishes,
Ashutosh Bapat
On 7/2/25 13:37, Ashutosh Bapat wrote:
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.
My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization. I don't know how significant is the improvement, but if
it's measurable we could just do that independently of our patches.
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.I have added Dmitry to this thread since he has written most of the
shared memory handling code.
Thanks.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?
Indirectly. My patch can work just fine with a single segment, but being
able to enable huge pages only for some of the segments seems better.
But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.
Right. The good thing is that's just "advisory" information, it doesn't
break anything if it's temporarily out of sync. We don't need to "stop"
everything to remap the buffers to other nodes, or anything like that.
Or at least I think so.
It's one thing to "flip" the target mapping (determining which node a
buffer should be on), and actually migrating the buffers. The first part
can be done instantaneously, the second part can happen in the
background over a longer time period.
I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.
regards
--
Tomas Vondra
On Wed, Jul 2, 2025 at 6:06 PM Tomas Vondra <tomas@vondra.me> wrote:
I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.
Before shrinking the buffers, we walk the free list removing any
buffers that are going to be removed. When expanding, by linking the
new buffers in the order and then adding those to the already existing
free list. 0005 patch in [1]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2 has the code for the same.
[1]: /messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
--
Best Wishes,
Ashutosh Bapat
On Wed, Jul 02, 2025 at 05:07:28PM +0530, Ashutosh Bapat wrote:
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.I have added Dmitry to this thread since he has written most of the
shared memory handling code.
Thanks! I like the idea behind this patch series. I haven't read it in
details yet, but I can imagine both patches (interleaving and online
resizing) could benefit from each other. In online resizing we've
introduced a possibility to use multiple shared mappings for different
types of data, maybe it would be convenient to use the same interface to
create separate mappings for different NUMA nodes as well. Using a
separate shared mapping per NUMA node would also make resizing easier,
since it would be more straightforward to fit an increased segment into
NUMA boundaries.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?
Right, separate segments would allow to mix and match huge pages with
pages of regular size. It's not implemented in the latest version of
online resizing patch, purely to reduce complexity and maintain the same
invariant (everything is either using huge pages or not) -- but we could
do it other way around as well.
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi!
1) v1-0001-NUMA-interleaving-buffers.patch
[..]
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.
Oh, now I get it! OK, let's stick to this one.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.
You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).
0. I think that we could do better, some counter arguments to
no-configuration-at-all:
a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)
b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?
c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)
d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:
2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.
.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.
Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:
1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:
numa_buffers_interleave=off numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
latency average = 0.373 ms
latency stddev = 0.237 ms
initial connection time = 45.899 ms
tps = 160242.147877 (without initial connection time)
numa_buffers_interleave=on numa_pgproc_interleave=on
latency average = 0.345 ms
latency stddev = 0.373 ms
initial connection time = 44.485 ms
tps = 177564.686094 (without initial connection time)
2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
default/numa_buffers_interleave=off
latency average = 1375.478 ms
latency stddev = 1141.423 ms
initial connection time = 46.104 ms
tps = 45.868075 (without initial connection time)
numa_buffers_interleave=on
latency average = 838.128 ms
latency stddev = 498.787 ms
initial connection time = 43.437 ms
tps = 75.413894 (without initial connection time)
and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
latency average = 727.717 ms
latency stddev = 410.767 ms
initial connection time = 45.119 ms
tps = 86.844161 (without initial connection time)
(but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?
3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)
4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?
5. In v1-0001, pg_numa_interleave_memory()
+ * XXX no return value, to make this fail on error, has to use
+ * numa_set_strict
Yes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.
+ * XXX Should we still touch the memory first, like
with numa_move_pages,
+ * or is that not necessary?
It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.
6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
Accidental indents (also fails to apply)
7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :
2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down
[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0
9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.
-J.
[1]: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
On 7/4/25 13:05, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi!
1) v1-0001-NUMA-interleaving-buffers.patch
[..]
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.Oh, now I get it! OK, let's stick to this one.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.
I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.
I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.
Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).0. I think that we could do better, some counter arguments to
no-configuration-at-all:a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:
I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.
2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc... .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.
I'm sorry, I don't understand what's the question :-(
Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:numa_buffers_interleave=off numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
latency average = 0.373 ms
latency stddev = 0.237 ms
initial connection time = 45.899 ms
tps = 160242.147877 (without initial connection time)numa_buffers_interleave=on numa_pgproc_interleave=on
latency average = 0.345 ms
latency stddev = 0.373 ms
initial connection time = 44.485 ms
tps = 177564.686094 (without initial connection time)2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
default/numa_buffers_interleave=off
latency average = 1375.478 ms
latency stddev = 1141.423 ms
initial connection time = 46.104 ms
tps = 45.868075 (without initial connection time)numa_buffers_interleave=on
latency average = 838.128 ms
latency stddev = 498.787 ms
initial connection time = 43.437 ms
tps = 75.413894 (without initial connection time)and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
latency average = 727.717 ms
latency stddev = 410.767 ms
initial connection time = 45.119 ms
tps = 86.844161 (without initial connection time)(but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
Good, thanks for the testing. I should have done something like this
when I posted my patches, but I forgot about that (and the email felt
too long anyway).
But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.
For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?
Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?
3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)
Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.
wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?
4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?
Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.
But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.
But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.
Still have to figure out the details, but seems it might help.
5. In v1-0001, pg_numa_interleave_memory()
+ * XXX no return value, to make this fail on error, has to use + * numa_set_strictYes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.+ * XXX Should we still touch the memory first, like with numa_move_pages, + * or is that not necessary?It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
Accidental indents (also fails to apply)
7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?
Right, we need to add those. Or actually, we need to think about how
we'd do this for non-NUMA systems. I wonder if we even want to just
build everything the "old way" (without the partitions, etc.).
But per the earlier comment, the partitioning seems beneficial even on
non-NUMA systems, so maybe the shims are good enough OK.
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0
Yeah, good catch. I'll look into that next week.
9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.
Mostly experimental, for benchmarking etc. I agree we may not want to
mess with the task scheduling too much.
Thanks for the feedback!
regards
--
Tomas Vondra
Hi Tomas,
I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.
However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.
6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.Driven by GUC "numa_procs_pin" (default: off).
In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.
We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.
The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:
CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]
CREATE PROFILE profile_name
[IF NOT EXISTS]
USING profile_manager
SET key = value [, key = value]...
[USING profile_manager
SET key = value [, key = value]...]
[...];
CREATE PROFILE MAPPING
[IF NOT EXISTS]
FOR PROFILE profile_name
[MATCH [ ALL | ANY ] (
[ROLE role_name],
[BACKEND TYPE backend_type],
[DATABASE database_name],
[APPLICATION appname]
)];
## PROFILE RESOLUTION ORDER
1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)
As currently designed, this approach allows quite a lot of flexibility:
* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary:
https://blogs.oracle.com/linux/post/numa-balancing )
* Someone else could implement support for Windows or BSD specifics.
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.
Hope this perspective is helpful.
Best regards,
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/5/25 09:09, Cédric Villemain wrote:
Hi Tomas,
I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.Driven by GUC "numa_procs_pin" (default: off).
In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]CREATE PROFILE profile_name
[IF NOT EXISTS]
USING profile_manager
SET key = value [, key = value]...
[USING profile_manager
SET key = value [, key = value]...]
[...];CREATE PROFILE MAPPING
[IF NOT EXISTS]
FOR PROFILE profile_name
[MATCH [ ALL | ANY ] (
[ROLE role_name],
[BACKEND TYPE backend_type],
[DATABASE database_name],
[APPLICATION appname]
)];## PROFILE RESOLUTION ORDER
1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)As currently designed, this approach allows quite a lot of flexibility:
* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary: https://blogs.oracle.com/
linux/post/numa-balancing )* Someone else could implement support for Windows or BSD specifics.
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?
regards
--
Tomas Vondra
Hi Tomas, some more thoughts after the weekend:
On Fri, Jul 4, 2025 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote:
On 7/4/25 13:05, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi!
1) v1-0001-NUMA-interleaving-buffers.patch
[..]
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.Oh, now I get it! OK, let's stick to this one.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?
In my book, not being swappable is a win (it's hard for me to imagine
when it could be beneficial to swap out parts of s_b).
I was trying to think about it and also got those:
Anyway mlock() probably sounds like it, but e.g. Rocky 8.10 by default
has max locked memory (ulimit -l) as low as 64kB due to systemd's
DefaultLimitMEMLOCK, but Debian/Ubuntu have those at higher values.
Wasn't expecting that - those are bizzare low values. I think we would
need something like (10000*900)/1024/1024 or more, but with each
PGPROC on a separate page that would be even way more?
Another thing with 4kB pages: there's this big assumption now made
that once we arrive in InitProcess() we won't ever change NUMA node,
so we stick to the PGPROC from where we started (based on getcpu(2)).
Let's assume CPU scheduler reassigned us to differnt node, but we have
now this 4kB patch ready for PGPROC in theory and this means we would
need to rely on the NUMA autobalancing doing it's job to migrate that
4kB page from node to node (to get better local accesses instead of
remote ones). The questions in my head are now like that:
- but we have asked intially asked those PGPROC pages to be localized
on certain node (they have policy), so they won't autobalance? We
would need to somewhere call getcpu() again notice the difference and
unlocalize (clear the NUMA/mbind() policy) for the PGPROC page?
- mlocked() as above says stick to physical RAM page (?) , so it won't move?
- after what time kernel's autobalancing would migrate that page since
switching the active CPU<->node? I mean do we execute enough reads on
this page?
BTW: to move this into pragmatic real, what's the most
one-liner/trivial way to exercise/stress PGPROC?
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).0. I think that we could do better, some counter arguments to
no-configuration-at-all:a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc... .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.I'm sorry, I don't understand what's the question :-(
That patch reference above, it was a chain of thought from step "d".
What I had in mind was that you cannot remove the patch
`v1-0002-NUMA-localalloc.patch` from the scope if forcing people to
use numactl by not having enough configurability on the PG side. That
is: if someone will have to use systemd+numactl
--interleave/--weighted-interleave then, he will also need to have a
way to use numa_localalloc=on (to override the new/user's policy
default, otherwise local mem allocations are also going to be
interleaved, and we are back to square one). Which brings me to a
point why instead of this toggle, should include the configuration
properly inside from start (it's not that hard apparently).
Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:
[.. removed numbers]
But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.
Well, pretty much the same here. I was after minimizing "stddev" (to
have better predictability of results, especially across restarts) and
increasing available bandwidth [which is pretty much related]. Without
our NUMA work, PG can just put that s_b on any random node or spill
randomly from to another (depending on size of allocation request).
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?
Sure thing, marked it as `Returned with feedback`, this approach seems
to be much more advanced.
3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?
I'm trying to convey exactly the opposite message or at least that it
might depend on configuration. Please see
/messages/by-id/CAKZiRmxYMPbQ4WiyJWh=Vuw_Ny+hLGH9_9FaacKRJvzZ-smm+w@mail.gmail.com
(btw it should read there that I don't indent spend a lot of thime on
PQ), but anyway: I think we should NOT pin the PQ workers the same
NODE as you do not know if there's CPU left there (same story as with
v1-0006 here).
I'm just proposing quick OS-based interleaving of PQ shm if using all
nodes, literally:
@@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
request_size,
}
*mapped_address = address;
*mapped_size = request_size;
+
+ /* We interleave memory only at creation time. */
+ if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) {
+ elog(DEBUG1, "interleaving shm mem @ %p size=%zu",
*mapped_address, *mapped_size);
+ pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes);
+ }
+
Because then if memory is interleaved you have probably less variance
for memory access. But also from that previous thread:
"So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this"
So wild idea was: maybe PQ shm interleaving should on NUMA
configuration (if intereavling to all nodes, then interleave normally,
but if configuration sets to just 1 NUMA node, it automatically binds
there -- there was '@' support for that in my patch).
4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.
Sure!
But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.Still have to figure out the details, but seems it might help.
Right, no idea how the shared_memory remapping patch will work
(how/when the s_b change will be executed), but we could somehow mark
that number of NUMA zones could be rechecked during SIGHUP (?) and
then just simple compare check if old_numa_num_configured_nodes ==
new_numa_num_configured_nodes is true.
Anyway, I think it's way too advanced for now, don't you think? (like
CPU ballooning [s_b itself] is rare, and NUMA ballooning seems to be
super-wild-rare).
As for the rest, forgot to include this too: getcpu() - this really
needs a portable pg_getcpu() wrapper.
-J.
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?
I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/7/25 16:51, Cédric Villemain wrote:
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).
I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...
The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.
Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.
The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).
regards
--
Tomas Vondra
Hi,
On 2025-07-05 07:09:00 +0000, C�dric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.
I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.
To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.
Greetings,
Andres Freund
On 7/7/25 16:51, Cédric Villemain wrote:
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).
The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?
Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);
...
pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
pmroutine->init_auxilliary != NULL)
pmroutine->init_auxilliary(MyProc);
Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.
Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/7/25 16:51, Cédric Villemain wrote:
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);...
pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
pmroutine->init_auxilliary != NULL)
pmroutine->init_auxilliary(MyProc);Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).
After a first read I think this works for patches 002 and 005. For this
last one, InitProcGlobal() may setup things as you do but then expose
the choice a bit later, basically in places where you added the if
condition on the GUC: numa_procs_interleave).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
Hi,
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.
Greetings,
Andres Freund
On 7/8/25 05:04, Andres Freund wrote:
Hi,
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.
That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.
If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.
regards
--
Tomas Vondra
On 7/8/25 03:55, Cédric Villemain wrote:
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).
But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.
I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?
Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.
regards
--
Tomas Vondra
Hi,
On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
On 7/8/25 05:04, Andres Freund wrote:
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.
Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.
I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.
Greetings,
Andres Freund