Adding basic NUMA awareness
Hi,
This is a WIP version of a patch series I'm working on, adding some
basic NUMA awareness for a couple parts of our shared memory (shared
buffers, etc.). It's based on Andres' experimental patches he spoke
about at pgconf.eu 2024 [1]https://www.youtube.com/watch?v=V75KpACdl6E, and while it's improved and polished in
various ways, it's still experimental.
But there's a recent thread aiming to do something similar [2]/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com, so
better to share it now so that we can discuss both approaches. This
patch set is a bit more ambitious, handling NUMA in a way to allow
smarter optimizations later, so I'm posting it in a separate thread.
The series is split into patches addressing different parts of the
shared memory, starting (unsurprisingly) from shared buffers, then
buffer freelists and ProcArray. There's a couple additional parts, but
those are smaller / addressing miscellaneous stuff.
Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.
Each patch should have a commit message explaining the intent and
implementation, and then also detailed comments explaining various
challenges and open questions.
But let me go over the basics, and discuss some of the design choices
and open questions that need solving.
1) v1-0001-NUMA-interleaving-buffers.patch
This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.
Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).
So the patch handles this by explicitly mapping chunks of shared buffers
to different nodes - a bit like interleaving, but in larger chunks.
Ideally each node gets (1/N) of shared buffers, as a contiguous chunk.
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.
There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.
The patch is fairly simple. Most of the complexity is about picking the
chunk size, and aligning the arrays (so that it nicely aligns with
memory pages).
The patch has a GUC "numa_buffers_interleave", with "off" by default.
2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.
The patch has a GUC "numa_localalloc", with "off" by default.
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.
4) v1-0004-NUMA-partition-buffer-freelist.patch
Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:
* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.
* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.
* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.
* none - nothing, sigle freelist
Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.
The main challenge is that with multiple smaller lists, a process can't
really use the whole shared buffers. So a single backed will only use
part of the memory. The more lists there are, the worse this effect is.
This is also why I think we won't use the other partitioning options,
because there's going to be more CPUs than NUMA nodes.
Obviously, this needs solving even with NUMA nodes - we need to allow a
single backend to utilize the whole shared buffers if needed. There
should be a way to "steal" buffers from other freelists (if the
"regular" freelist is empty), but the patch does not implement this.
Shouldn't be hard, I think.
The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).
I do have a separate experimental patch doing something like that, I
need to make it part of this branch.
5) v1-0005-NUMA-interleave-PGPROC-entries.patch
Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because
(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).
(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.
The fast-path arrays are less of a problem, because those tend to be
larger, and are accessed through pointers, so we can just adjust that.
So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.
This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.
But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.
There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.
There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).
This has a GUC "numa_procs_interleave", again "off" by default. It's not
quite correct, though, because the partitioning happens always. It only
affects the PGPROC lookup. (In a way, this may be a bit broken.)
6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.
Driven by GUC "numa_procs_pin" (default: off).
Summary
-------
So this is what I have at the moment. I've tried to organize the patches
in the order of importance, but that's just my guess. It's entirely
possible there's something I missed, some other order might make more
sense, etc.
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.
But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).
I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2]/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com proposed to do).
regards
[1]: https://www.youtube.com/watch?v=V75KpACdl6E
[2]: /messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com
/messages/by-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com
--
Tomas Vondra
Attachments:
v1-0001-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v1-0001-NUMA-interleaving-buffers.patchDownload
From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 6 May 2025 21:12:21 +0200
Subject: [PATCH v1 1/6] NUMA: interleaving buffers
Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).
The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc. It's less dependent on what the CPU
scheduler does, etc.
Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.
The effect is similar to
numactl --interleave=all
but there's a number of important differences.
Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).
Secondly, it considers the page and block size, and makes sure not to
split a buffer on different NUMA nodes (which with the regular
interleaving is guaranteed to happen, unless when using huge pages). The
patch performs "explicit" interleaving, so that buffers are not split
like this.
The patch maps both buffers and buffer descriptors, so that the buffer
and it's buffer descriptor end up on the same NUMA node.
The mapping happens in larger chunks (see choose_chunk_items). This is
required to handle buffer descriptors (which are smaller than buffers),
and it should also help to reduce the number of mappings. Most NUMA
systems will use 1GB chunks, unless using very small shared buffers.
Notes:
* The feature is enabled by numa_buffers_interleave GUC (false by default)
* It's not clear we want to enable interleaving for all shared memory.
We probably want that for shared buffers, but maybe not for ProcArray
or freelists.
* Similar questions are about huge pages - in general it's a good idea,
but maybe it's not quite good for ProcArray. It's somewhate separate
from NUMA, but not entirely because NUMA works on page granularity.
PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
as we don't want to split the entry to multiple nodes. But could be
done explicitly, by specifying which node to use for the pages.
* We could partition ProcArray, with one partition per NUMA node, and
then at connection time pick a node from the same node. The process
could migrate to some other node later, especially for long-lived
connections, but there's no perfect solution, Maybe we could set
affinity to cores from the same node, or something like that?
---
src/backend/storage/buffer/buf_init.c | 384 +++++++++++++++++++++++++-
src/backend/storage/buffer/bufmgr.c | 1 +
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_tables.c | 10 +
src/bin/pgbench/pgbench.c | 67 ++---
src/include/miscadmin.h | 2 +
src/include/storage/bufmgr.h | 1 +
7 files changed, 427 insertions(+), 41 deletions(-)
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..2ad34624c49 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
*/
#include "postgres.h"
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
@@ -25,6 +33,19 @@ WritebackContext BackendWritebackContext;
CkptSortItem *CkptBufferIds;
+static Size get_memory_page_size(void);
+static int64 choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes);
+static void pg_numa_interleave_memory(char *startptr, char *endptr,
+ Size mem_page_size, Size chunk_size,
+ int num_nodes);
+
+/* number of buffers allocated on the same NUMA node */
+static int64 numa_chunk_buffers = -1;
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int numa_nodes = -1;
+
+
/*
* Data Structures:
* buffers live in a freelist and a lookup data structure.
@@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
foundDescs,
foundIOCV,
foundBufCkpt;
+ Size mem_page_size;
+ Size buffer_align;
+
+ /*
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ *
+ * XXX Another issue is we may get different values than when sizing the
+ * the memory, because at that point we didn't know if we get huge pages,
+ * so we assumed we will. Shouldn't cause crashes, but we might allocate
+ * shared memory and then not use some of it (because of the alignment
+ * that we don't actually need). Not sure about better way, good for now.
+ */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
+
+ /*
+ * With NUMA we need to ensure the buffers are properly aligned not just
+ * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+ * on page granularity, and we don't want a buffer to get split to
+ * multiple nodes (when using multiple memory pages).
+ *
+ * We also don't want to interfere with other parts of shared memory,
+ * which could easily happen with huge pages (e.g. with data stored before
+ * buffers).
+ *
+ * We do this by aligning to the larger of the two values (we know both
+ * are power-of-two values, so the larger value is automatically a
+ * multiple of the lesser one).
+ *
+ * XXX Maybe there's a way to use less alignment?
+ *
+ * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+ * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+ * that doesn't seem quite worth it. Maybe we should simply align to
+ * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+ * other stuff stored in shared memory that we want to allocate on a
+ * particular NUMA node (e.g. ProcArray).
+ *
+ * XXX Maybe with "too large" huge pages we should just not do this, or
+ * maybe do this only for sufficiently large areas (e.g. shared buffers,
+ * but not ProcArray).
+ */
+ buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+ /* one page is a multiple of the other */
+ Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+ ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
- /* Align descriptors to a cacheline boundary. */
+ /*
+ * Align descriptors to a cacheline boundary, and memory page.
+ *
+ * We want to distribute both to NUMA nodes, so that each buffer and it's
+ * descriptor are on the same NUMA node. So we align both the same way.
+ *
+ * XXX The memory page is always larger than cacheline, so the cacheline
+ * reference is a bit unnecessary.
+ *
+ * XXX In principle we only need to do this with NUMA, otherwise we could
+ * still align just to cacheline, as before.
+ */
BufferDescriptors = (BufferDescPadded *)
- ShmemInitStruct("Buffer Descriptors",
- NBuffers * sizeof(BufferDescPadded),
- &foundDescs);
+ TYPEALIGN(buffer_align,
+ ShmemInitStruct("Buffer Descriptors",
+ NBuffers * sizeof(BufferDescPadded) + buffer_align,
+ &foundDescs));
/* Align buffer pool on IO page size boundary. */
BufferBlocks = (char *)
- TYPEALIGN(PG_IO_ALIGN_SIZE,
+ TYPEALIGN(buffer_align,
ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+ NBuffers * (Size) BLCKSZ + buffer_align,
&foundBufs));
/* Align condition variables to cacheline boundary. */
@@ -112,6 +195,63 @@ BufferManagerShmemInit(void)
{
int i;
+ /*
+ * Assign chunks of buffers and buffer descriptors to the available
+ * NUMA nodes. We can't use the regular interleaving, because with
+ * regular memory pages (smaller than BLCKSZ) we'd split all buffers
+ * to multiple NUMA nodes. And we don't want that.
+ *
+ * But even with huge pages it seems like a good idea to not have
+ * mapping for each page.
+ *
+ * So we always assign a larger contiguous chunk of buffers to the
+ * same NUMA node, as calculated by choose_chunk_buffers(). We try to
+ * keep the chunks large enough to work both for buffers and buffer
+ * descriptors, but not too large. See the comments at
+ * choose_chunk_buffers() for details.
+ *
+ * Thanks to the earlier alignment (to memory page etc.), we know the
+ * buffers won't get split, etc.
+ *
+ * This also makes it easier / straightforward to calculate which NUMA
+ * node a buffer belongs to (it's a matter of divide + mod). See
+ * BufferGetNode().
+ */
+ if (numa_buffers_interleave)
+ {
+ char *startptr,
+ *endptr;
+ Size chunk_size;
+
+ numa_nodes = numa_num_configured_nodes();
+
+ numa_chunk_buffers
+ = choose_chunk_buffers(NBuffers, mem_page_size, numa_nodes);
+
+ elog(LOG, "BufferManagerShmemInit num_nodes %d chunk_buffers %ld",
+ numa_nodes, numa_chunk_buffers);
+
+ /* first map buffers */
+ startptr = BufferBlocks;
+ endptr = startptr + ((Size) NBuffers) * BLCKSZ;
+ chunk_size = (numa_chunk_buffers * BLCKSZ);
+
+ pg_numa_interleave_memory(startptr, endptr,
+ mem_page_size,
+ chunk_size,
+ numa_nodes);
+
+ /* now do the same for buffer descriptors */
+ startptr = (char *) BufferDescriptors;
+ endptr = startptr + ((Size) NBuffers) * sizeof(BufferDescPadded);
+ chunk_size = (numa_chunk_buffers * sizeof(BufferDescPadded));
+
+ pg_numa_interleave_memory(startptr, endptr,
+ mem_page_size,
+ chunk_size,
+ numa_nodes);
+ }
+
/*
* Initialize all the buffer headers.
*/
@@ -144,6 +284,11 @@ BufferManagerShmemInit(void)
GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
+ /*
+ * As this point we have all the buffers in a single long freelist. With
+ * freelist partitioning we rebuild them in StrategyInitialize.
+ */
+
/* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);
@@ -152,24 +297,72 @@ BufferManagerShmemInit(void)
&backend_flush_after);
}
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+ Size os_page_size;
+ Size huge_page_size;
+
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ /* assume huge pages get used, unless HUGE_PAGES_OFF */
+ if (huge_pages_status != HUGE_PAGES_OFF)
+ GetHugePageSize(&huge_page_size, NULL);
+ else
+ huge_page_size = 0;
+
+ return Max(os_page_size, huge_page_size);
+}
+
/*
* BufferManagerShmemSize
*
* compute the size of shared memory for the buffer pool including
* data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
*/
Size
BufferManagerShmemSize(void)
{
Size size = 0;
+ Size mem_page_size;
+
+ /* XXX why does IsUnderPostmaster matter? */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
/* size of buffer descriptors */
size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
/* to allow aligning buffer descriptors */
- size = add_size(size, PG_CACHE_LINE_SIZE);
+ size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
/* size of data pages, plus alignment padding */
- size = add_size(size, PG_IO_ALIGN_SIZE);
+ size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
@@ -186,3 +379,178 @@ BufferManagerShmemSize(void)
return size;
}
+
+/*
+ * choose_chunk_buffers
+ * choose the number of buffers allocated to a NUMA node at once
+ *
+ * We don't map shared buffers to NUMA nodes one by one, but in larger chunks.
+ * This is both for efficiency reasons (fewer mappings), and also because we
+ * want to map buffer descriptors too - and descriptors are much smaller. So
+ * we pick a number that's high enough for descriptors to use whole pages.
+ *
+ * We also want to keep buffers somehow evenly distributed on nodes, with
+ * about NBuffers/nodes per node. So we don't use chunks larger than this,
+ * to keep it as fair as possible (the chunk size is a possible difference
+ * between memory allocated to different NUMA nodes).
+ *
+ * It's possible shared buffers are so small this is not possible (i.e.
+ * it's less than chunk_size). But sensible NUMA systems will use a lot
+ * of memory, so this is unlikely.
+ *
+ * We simply print a warning about the misbalance, and that's it.
+ *
+ * XXX It'd be good to ensure the chunk size is a power-of-2, because then
+ * we could calculate the NUMA node simply by shift/modulo, while now we
+ * have to do a division. But we don't know how many buffers and buffer
+ * descriptors fits into a memory page. It may not be a power-of-2.
+ */
+static int64
+choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
+{
+ int64 num_items;
+ int64 max_items;
+
+ /* make sure the chunks will align nicely */
+ Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+ Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
+ Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
+
+ /*
+ * The minimum number of items to fill a memory page with descriptors and
+ * blocks. The NUMA allocates memory in pages, and we need to do that for
+ * both buffers and descriptors.
+ *
+ * In practice the BLCKSZ doesn't really matter, because it's much larger
+ * than BufferDescPadded, so the result is determined buffer descriptors.
+ * But it's clearer this way.
+ */
+ num_items = Max(mem_page_size / sizeof(BufferDescPadded),
+ mem_page_size / BLCKSZ);
+
+ /*
+ * We shouldn't use chunks larger than NBuffers/num_nodes, because with
+ * larger chunks the last NUMA node would end up with much less memory (or
+ * no memory at all).
+ */
+ max_items = (NBuffers / num_nodes);
+
+ /*
+ * Did we already exceed the maximum desirable chunk size? That is, will
+ * the last node get less than one whole chunk (or no memory at all)?
+ */
+ if (num_items > max_items)
+ elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
+ num_items, max_items);
+
+ /* grow the chunk size until we hit the max limit. */
+ while (2 * num_items <= max_items)
+ num_items *= 2;
+
+ /*
+ * XXX It's not difficult to construct cases where we end up with not
+ * quite balanced distribution. For example, with shared_buffers=10GB and
+ * 4 NUMA nodes, we end up with 2GB chunks, which means the first node
+ * gets 4GB, and the three other nodes get 2GB each.
+ *
+ * We could be smarter, and try to get more balanced distribution. We
+ * could simply reduce max_items e.g. to
+ *
+ * max_items = (NBuffers / num_nodes) / 4;
+ *
+ * in which cases we'd end up with 512MB chunks, and each nodes would get
+ * the same 2.5GB chunk. It may not always work out this nicely, but it's
+ * better than with (NBuffers / num_nodes).
+ *
+ * Alternatively, we could "backtrack" - try with the large max_items,
+ * check how balanced it is, and if it's too imbalanced, try with a
+ * smaller one.
+ *
+ * We however want a simple scheme.
+ */
+
+ return num_items;
+}
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+ /* not NUMA interleaving */
+ if (numa_chunk_buffers == -1)
+ return -1;
+
+ return (buffer / numa_chunk_buffers) % numa_nodes;
+}
+
+/*
+ * pg_numa_interleave_memory
+ * move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ * of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_interleave_memory(char *startptr, char *endptr,
+ Size mem_page_size, Size chunk_size,
+ int num_nodes)
+{
+ volatile uint64 touch pg_attribute_unused();
+ char *ptr = startptr;
+
+ /* chunk size has to be a multiple of memory page */
+ Assert((chunk_size % mem_page_size) == 0);
+
+ /*
+ * Walk the memory pages in the range, and determine the node for each
+ * one. We use numa_tonode_memory(), because then we can move a whole
+ * memory range to the node, we don't need to worry about individual pages
+ * like with numa_move_pages().
+ */
+ while (ptr < endptr)
+ {
+ /* We may have an incomplete chunk at the end. */
+ Size sz = Min(chunk_size, (endptr - ptr));
+
+ /*
+ * What NUMA node does this range belong to? Each chunk should go to
+ * the same NUMA node, in a round-robin manner.
+ */
+ int node = ((ptr - startptr) / chunk_size) % num_nodes;
+
+ /*
+ * Nope, we have the first buffer from the next memory page, and we'll
+ * set NUMA node for it (and all pages up to the next buffer). The
+ * buffer should align with the memory page, thanks to the
+ * buffer_align earlier.
+ */
+ Assert((int64) ptr % mem_page_size == 0);
+ Assert((sz % mem_page_size) == 0);
+
+ /*
+ * XXX no return value, to make this fail on error, has to use
+ * numa_set_strict
+ *
+ * XXX Should we still touch the memory first, like with numa_move_pages,
+ * or is that not necessary?
+ */
+ numa_tonode_memory(ptr, sz, node);
+
+ ptr += sz;
+ }
+
+ /* should have processed all chunks */
+ Assert(ptr == endptr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 406ce77693c..e1e1cfd379d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -685,6 +685,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
BufferDesc *bufHdr;
BufferTag tag;
uint32 buf_state;
+
Assert(BufferIsValid(recent_buffer));
ResourceOwnerEnlarge(CurrentResourceOwner);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int max_worker_processes = 8;
int max_parallel_workers = 8;
int MaxBackends = 0;
+/* NUMA stuff */
+bool numa_buffers_interleave = false;
+
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 511dc32d519..198a57e70a5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables NUMA interleaving of shared buffers."),
+ gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+ },
+ &numa_buffers_interleave,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 69b6a877dc9..c07de903f76 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -305,7 +305,7 @@ static const char *progname;
#define CPU_PINNING_RANDOM 1
#define CPU_PINNING_COLOCATED 2
-static int pinning_mode = CPU_PINNING_NONE;
+static int pinning_mode = CPU_PINNING_NONE;
#define WSEP '@' /* weight separator */
@@ -874,20 +874,20 @@ static bool socket_has_input(socket_set *sa, int fd, int idx);
*/
typedef struct cpu_generator_state
{
- int ncpus; /* number of CPUs available */
- int nitems; /* number of items in the queue */
- int *nthreads; /* number of threads for each CPU */
- int *nclients; /* number of processes for each CPU */
- int *items; /* queue of CPUs to pick from */
-} cpu_generator_state;
+ int ncpus; /* number of CPUs available */
+ int nitems; /* number of items in the queue */
+ int *nthreads; /* number of threads for each CPU */
+ int *nclients; /* number of processes for each CPU */
+ int *items; /* queue of CPUs to pick from */
+} cpu_generator_state;
static cpu_generator_state cpu_generator_init(int ncpus);
-static void cpu_generator_refill(cpu_generator_state *state);
-static void cpu_generator_reset(cpu_generator_state *state);
-static int cpu_generator_thread(cpu_generator_state *state);
-static int cpu_generator_client(cpu_generator_state *state, int thread_cpu);
-static void cpu_generator_print(cpu_generator_state *state);
-static bool cpu_generator_check(cpu_generator_state *state);
+static void cpu_generator_refill(cpu_generator_state * state);
+static void cpu_generator_reset(cpu_generator_state * state);
+static int cpu_generator_thread(cpu_generator_state * state);
+static int cpu_generator_client(cpu_generator_state * state, int thread_cpu);
+static void cpu_generator_print(cpu_generator_state * state);
+static bool cpu_generator_check(cpu_generator_state * state);
static void reset_pinning(TState *threads, int nthreads);
@@ -7422,7 +7422,7 @@ main(int argc, char **argv)
/* try to assign threads/clients to CPUs */
if (pinning_mode != CPU_PINNING_NONE)
{
- int nprocs = get_nprocs();
+ int nprocs = get_nprocs();
cpu_generator_state state = cpu_generator_init(nprocs);
retry:
@@ -7433,6 +7433,7 @@ retry:
for (i = 0; i < nthreads; i++)
{
TState *thread = &threads[i];
+
thread->cpu = cpu_generator_thread(&state);
}
@@ -7444,7 +7445,7 @@ retry:
while (true)
{
/* did we find any unassigned backend? */
- bool found = false;
+ bool found = false;
for (i = 0; i < nthreads; i++)
{
@@ -7678,10 +7679,10 @@ threadRun(void *arg)
/* determine PID of the backend, pin it to the same CPU */
for (int i = 0; i < nstate; i++)
{
- char *pid_str;
- pid_t pid;
+ char *pid_str;
+ pid_t pid;
- PGresult *res = PQexec(state[i].con, "select pg_backend_pid()");
+ PGresult *res = PQexec(state[i].con, "select pg_backend_pid()");
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pg_fatal("could not determine PID of the backend for client %d",
@@ -8184,7 +8185,7 @@ cpu_generator_init(int ncpus)
{
struct timeval tv;
- cpu_generator_state state;
+ cpu_generator_state state;
state.ncpus = ncpus;
@@ -8207,7 +8208,7 @@ cpu_generator_init(int ncpus)
}
static void
-cpu_generator_refill(cpu_generator_state *state)
+cpu_generator_refill(cpu_generator_state * state)
{
struct timeval tv;
@@ -8223,7 +8224,7 @@ cpu_generator_refill(cpu_generator_state *state)
}
static void
-cpu_generator_reset(cpu_generator_state *state)
+cpu_generator_reset(cpu_generator_state * state)
{
state->nitems = 0;
cpu_generator_refill(state);
@@ -8236,15 +8237,15 @@ cpu_generator_reset(cpu_generator_state *state)
}
static int
-cpu_generator_thread(cpu_generator_state *state)
+cpu_generator_thread(cpu_generator_state * state)
{
if (state->nitems == 0)
cpu_generator_refill(state);
while (true)
{
- int idx = lrand48() % state->nitems;
- int cpu = state->items[idx];
+ int idx = lrand48() % state->nitems;
+ int cpu = state->items[idx];
state->items[idx] = state->items[state->nitems - 1];
state->nitems--;
@@ -8256,10 +8257,10 @@ cpu_generator_thread(cpu_generator_state *state)
}
static int
-cpu_generator_client(cpu_generator_state *state, int thread_cpu)
+cpu_generator_client(cpu_generator_state * state, int thread_cpu)
{
- int min_clients;
- bool has_valid_cpus = false;
+ int min_clients;
+ bool has_valid_cpus = false;
for (int i = 0; i < state->nitems; i++)
{
@@ -8284,8 +8285,8 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
while (true)
{
- int idx = lrand48() % state->nitems;
- int cpu = state->items[idx];
+ int idx = lrand48() % state->nitems;
+ int cpu = state->items[idx];
if (cpu == thread_cpu)
continue;
@@ -8303,7 +8304,7 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
}
static void
-cpu_generator_print(cpu_generator_state *state)
+cpu_generator_print(cpu_generator_state * state)
{
for (int i = 0; i < state->ncpus; i++)
{
@@ -8312,10 +8313,10 @@ cpu_generator_print(cpu_generator_state *state)
}
static bool
-cpu_generator_check(cpu_generator_state *state)
+cpu_generator_check(cpu_generator_state * state)
{
- int min_count = INT_MAX,
- max_count = 0;
+ int min_count = INT_MAX,
+ max_count = 0;
for (int i = 0; i < state->ncpus; i++)
{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..c257c8a1c20 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -319,6 +319,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
/* in buf_init.c */
extern void BufferManagerShmemInit(void);
extern Size BufferManagerShmemSize(void);
+extern int BufferGetNode(Buffer buffer);
/* in localbuf.c */
extern void AtProcExit_LocalBuffers(void);
--
2.49.0
v1-0002-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v1-0002-NUMA-localalloc.patchDownload
From 6919b1c1c59a6084017ebae5a884bb6c60639364 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v1 2/6] NUMA: localalloc
Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).
This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.
XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
src/backend/utils/init/globals.c | 1 +
src/backend/utils/init/miscinit.c | 16 ++++++++++++++++
src/backend/utils/misc/guc_tables.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
4 files changed, 28 insertions(+)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int MaxBackends = 0;
/* NUMA stuff */
bool numa_buffers_interleave = false;
+bool numa_localalloc = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 43b4dbccc3d..d11936691b2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
#include <arpa/inet.h>
#include <utime.h>
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
@@ -164,6 +168,18 @@ InitPostmasterChild(void)
(errcode_for_socket_access(),
errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
#endif
+
+#ifdef USE_LIBNUMA
+ /*
+ * Set the default allocation policy to local node, where the task is
+ * executing at the time of a page fault.
+ *
+ * XXX I believe this is not necessary, now that we don't use automatic
+ * interleaving (numa_set_interleave_mask).
+ */
+ if (numa_localalloc)
+ numa_set_localalloc();
+#endif
}
/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 198a57e70a5..57f2df7ab74 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables setting the default allocation policy to local node."),
+ gettext_noop("When enabled, allocate from the node where the task is executing."),
+ },
+ &numa_localalloc,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
--
2.49.0
v1-0003-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v1-0003-freelist-Don-t-track-tail-of-a-freelist.patchDownload
From c2b2edb71d629ebe4283b636f058b8e42d1f1a35 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v1 3/6] freelist: Don't track tail of a freelist
The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
src/backend/storage/buffer/freelist.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
pg_atomic_uint32 nextVictimBuffer;
int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
/*
* Statistics. These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
{
buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
}
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
* assume it was previously set up by BufferManagerShmemInit().
*/
StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
--
2.49.0
v1-0004-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v1-0004-NUMA-partition-buffer-freelist.patchDownload
From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v1 4/6] NUMA: partition buffer freelist
Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.
There are four strategies, specified by GUC numa_partition_freelist
* none - single long freelist, should work just like now
* node - one freelist per NUMA node, with only buffers from that node
* cpu - one freelist per CPU
* pid - freelist determined by PID (same number of freelists as 'cpu')
When allocating a buffer, it's taken from the correct freelist (e.g.
same NUMA node).
Note: This is (probably) more important than partitioning ProcArray.
---
src/backend/storage/buffer/buf_init.c | 4 +-
src/backend/storage/buffer/freelist.c | 324 +++++++++++++++++++++++---
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 18 ++
src/include/miscadmin.h | 1 +
src/include/storage/bufmgr.h | 8 +
6 files changed, 327 insertions(+), 29 deletions(-)
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 2ad34624c49..920f1a32a8f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -543,8 +543,8 @@ pg_numa_interleave_memory(char *startptr, char *endptr,
* XXX no return value, to make this fail on error, has to use
* numa_set_strict
*
- * XXX Should we still touch the memory first, like with numa_move_pages,
- * or is that not necessary?
+ * XXX Should we still touch the memory first, like with
+ * numa_move_pages, or is that not necessary?
*/
numa_tonode_memory(ptr, sz, node);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..c93ec2841c5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,41 @@
*/
#include "postgres.h"
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
#include "pgstat.h"
#include "port/atomics.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/proc.h"
#define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var))))
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+ /* Spinlock: protects the values below */
+ slock_t freelist_lock;
+
+ /*
+ * XXX Not sure why this needs to be aligned like this. Need to ask
+ * Andres.
+ */
+ int firstFreeBuffer __attribute__((aligned(64))); /* Head of list of
+ * unused buffers */
+
+ /* Number of buffers consumed from this list. */
+ uint64 consumed;
+} BufferStrategyFreelist;
/*
* The shared freelist control information.
@@ -39,8 +66,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -51,13 +76,27 @@ typedef struct
/*
* Bgworker process to be notified upon activity or -1 if none. See
* StrategyNotifyBgWriter.
+ *
+ * XXX Not sure why this needs to be aligned like this. Need to ask
+ * Andres. Also, shouldn't the alignment be specified after, like for
+ * "consumed"?
*/
- int bgwprocno;
+ int __attribute__((aligned(64))) bgwprocno;
+
+ BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
} BufferStrategyControl;
/* Pointers to shared state */
static BufferStrategyControl *StrategyControl = NULL;
+/*
+ * XXX shouldn't this be in BufferStrategyControl? Probably not, we need to
+ * calculate it during sizing, and perhaps it could change before the memory
+ * gets allocated (so we need to remember the values).
+ */
+static int strategy_nnodes;
+static int strategy_ncpus;
+
/*
* Private (non-shared) state for managing a ring of shared buffers to re-use.
* This is currently the only kind of BufferAccessStrategy object, but someday
@@ -157,6 +196,90 @@ ClockSweepTick(void)
return victim;
}
+/*
+ * ChooseFreeList
+ * Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+ unsigned cpu;
+ unsigned node;
+ int rc;
+
+ int freelist_idx;
+
+ /* freelist not partitioned, return the first (and only) freelist */
+ if (numa_partition_freelist == FREELIST_PARTITION_NONE)
+ return &StrategyControl->freelists[0];
+
+ /*
+ * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+ * list based on that.
+ */
+ rc = getcpu(&cpu, &node);
+ if (rc != 0)
+ elog(ERROR, "getcpu failed: %m");
+
+ /*
+ * FIXME This doesn't work well if CPUs are excluded from being run or
+ * offline. In that case we end up not using some freelists at all, but
+ * not sure if we need to worry about that. Probably not for now. But
+ * could that change while the system is running?
+ *
+ * XXX Maybe we should somehow detect changes to the list of CPUs, and
+ * rebuild the lists if that changes? But that seems expensive.
+ */
+ if (cpu > strategy_ncpus)
+ elog(ERROR, "cpu out of range: %d > %u", cpu, strategy_ncpus);
+ else if (node > strategy_nnodes)
+ elog(ERROR, "node out of range: %d > %u", cpu, strategy_nnodes);
+
+ /*
+ * Pick the freelist, based on CPU, NUMA node or process PID. This matches
+ * how we built the freelists above.
+ *
+ * XXX Can we rely on some of the values (especially strategy_nnodes) to
+ * be a power-of-2? Then we could replace the modulo with a mask, which is
+ * likely more efficient.
+ */
+ switch (numa_partition_freelist)
+ {
+ case FREELIST_PARTITION_CPU:
+ freelist_idx = cpu % strategy_ncpus;
+ break;
+
+ case FREELIST_PARTITION_NODE:
+ freelist_idx = node % strategy_nnodes;
+ break;
+
+ case FREELIST_PARTITION_PID:
+ freelist_idx = MyProcPid % strategy_ncpus;
+ break;
+
+ default:
+ elog(ERROR, "unknown freelist partitioning value");
+ }
+
+ return &StrategyControl->freelists[freelist_idx];
+}
+
/*
* have_free_buffer -- a lockless check to see if there is a free buffer in
* buffer pool.
@@ -168,10 +291,13 @@ ClockSweepTick(void)
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
+ for (int i = 0; i < strategy_ncpus; i++)
+ {
+ if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+ return true;
+ }
+
+ return false;
}
/*
@@ -193,6 +319,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
int bgwprocno;
int trycounter;
uint32 local_buf_state; /* to avoid repeated (de-)referencing */
+ BufferStrategyFreelist *freelist;
*from_ring = false;
@@ -259,31 +386,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
* manipulate them without holding the spinlock.
*/
- if (StrategyControl->firstFreeBuffer >= 0)
+ freelist = ChooseFreeList();
+ if (freelist->firstFreeBuffer >= 0)
{
while (true)
{
/* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ SpinLockAcquire(&freelist->freelist_lock);
- if (StrategyControl->firstFreeBuffer < 0)
+ if (freelist->firstFreeBuffer < 0)
{
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
break;
}
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+ buf = GetBufferDescriptor(freelist->firstFreeBuffer);
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
/* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
+ freelist->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
+ /* increment number of buffers we consumed from this list */
+ freelist->consumed++;
+
/*
* Release the lock so someone else can access the freelist while
* we check out this buffer.
*/
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +436,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /*
+ * Nothing on the freelist, so run the "clock sweep" algorithm
+ *
+ * XXX Should we also make this NUMA-aware, to only access buffers from
+ * the same NUMA node? That'd probably mean we need to make the clock
+ * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+ * subset of buffers. But that also means each process could "sweep" only
+ * a fraction of buffers, even if the other buffers are better candidates
+ * for eviction. Would that also mean we'd have multiple bgwriters, one
+ * for each node, or would one bgwriter handle all of that?
+ */
trycounter = NBuffers;
for (;;)
{
@@ -352,11 +493,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
/*
* StrategyFreeBuffer: put a buffer on the freelist
+ *
+ * XXX This calls ChooseFreeList() again, and it might return the freelist to
+ * a different freelist than it was taken from (either by a different backend,
+ * or perhaps even the same backend running on a different CPU). Is that good?
+ * Maybe we should try to balance this somehow, e.g. by choosing a random list,
+ * the shortest one, or something like that? But that breaks the whole idea of
+ * having freelists with buffers from a particular NUMA node.
*/
void
StrategyFreeBuffer(BufferDesc *buf)
{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ BufferStrategyFreelist *freelist;
+
+ freelist = ChooseFreeList();
+
+ SpinLockAcquire(&freelist->freelist_lock);
/*
* It is possible that we are told to put something in the freelist that
@@ -364,11 +516,11 @@ StrategyFreeBuffer(BufferDesc *buf)
*/
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
{
- buf->freeNext = StrategyControl->firstFreeBuffer;
- StrategyControl->firstFreeBuffer = buf->buf_id;
+ buf->freeNext = freelist->firstFreeBuffer;
+ freelist->firstFreeBuffer = buf->buf_id;
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
}
/*
@@ -432,6 +584,42 @@ StrategyNotifyBgWriter(int bgwprocno)
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+ for (int node = 0; node < strategy_ncpus; node++)
+ {
+ BufferStrategyFreelist *freelist = &StrategyControl->freelists[node];
+ uint64 remain = 0;
+ uint64 actually_free = 0;
+ int cur = freelist->firstFreeBuffer;
+
+ while (cur >= 0)
+ {
+ uint32 local_buf_state;
+ BufferDesc *buf;
+
+ buf = GetBufferDescriptor(cur);
+
+ remain++;
+
+ local_buf_state = LockBufHdr(buf);
+
+ if (!(local_buf_state & BM_TAG_VALID))
+ actually_free++;
+
+ UnlockBufHdr(buf, local_buf_state);
+
+ cur = buf->freeNext;
+ }
+ elog(LOG, "freelist %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+ node,
+ freelist->firstFreeBuffer,
+ freelist->consumed,
+ remain, actually_free);
+ }
+}
/*
* StrategyShmemSize
@@ -446,11 +634,33 @@ StrategyShmemSize(void)
{
Size size = 0;
+ /* FIXME */
+#ifdef USE_LIBNUMA
+ strategy_ncpus = numa_num_task_cpus();
+ strategy_nnodes = numa_num_task_nodes();
+#else
+ strategy_ncpus = 1;
+ strategy_nnodes = 1;
+#endif
+
+ Assert(strategy_nnodes <= strategy_ncpus);
+
/* size of lookup hash table ... see comment in StrategyInitialize */
size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
/* size of the shared replacement strategy control block */
- size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+ size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+ /*
+ * Allocate one frelist per CPU. We might use per-node freelists, but the
+ * assumption is the number of CPUs is less than number of NUMA nodes.
+ *
+ * FIXME This assumes the we have more CPUs than NUMA nodes, which seems
+ * like a safe assumption. But maybe we should calculate how many elements
+ * we actually need, depending on the GUC? Not a huge amount of memory.
+ */
+ size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+ strategy_ncpus)));
return size;
}
@@ -466,6 +676,7 @@ void
StrategyInitialize(bool init)
{
bool found;
+ int buffers_per_cpu;
/*
* Initialize the shared buffer lookup hashtable.
@@ -484,23 +695,27 @@ StrategyInitialize(bool init)
*/
StrategyControl = (BufferStrategyControl *)
ShmemInitStruct("Buffer Strategy Status",
- sizeof(BufferStrategyControl),
+ offsetof(BufferStrategyControl, freelists) +
+ (sizeof(BufferStrategyFreelist) * strategy_ncpus),
&found);
if (!found)
{
+ /*
+ * XXX Calling get_nprocs() may not be quite correct, because some of
+ * the processors may get disabled, etc.
+ */
+ int num_cpus = get_nprocs();
+
/*
* Only done once, usually in postmaster
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
+ /* register callback to dump some stats on exit */
+ before_shmem_exit(freelist_before_shmem_exit, 0);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
+ SpinLockInit(&StrategyControl->buffer_strategy_lock);
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +726,61 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwprocno = -1;
+
+ /*
+ * Rebuild the freelist - right now all buffers are in one huge list,
+ * we want to rework that into multiple lists. Start by initializing
+ * the strategy to have empty lists.
+ */
+ for (int nfreelist = 0; nfreelist < strategy_ncpus; nfreelist++)
+ {
+ BufferStrategyFreelist *freelist;
+
+ freelist = &StrategyControl->freelists[nfreelist];
+
+ freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+ SpinLockInit(&freelist->freelist_lock);
+ }
+
+ /* buffers per CPU (also used for PID partitioning) */
+ buffers_per_cpu = (NBuffers / strategy_ncpus);
+
+ elog(LOG, "NBuffers: %d, nodes %d, ncpus: %d, divide: %d, remain: %d",
+ NBuffers, strategy_nnodes, strategy_ncpus,
+ buffers_per_cpu, NBuffers - (strategy_ncpus * buffers_per_cpu));
+
+ /*
+ * Walk through the buffers, add them to the correct list. Walk from
+ * the end, because we're adding the buffers to the beginning.
+ */
+ for (int i = NBuffers - 1; i >= 0; i--)
+ {
+ BufferDesc *buf = GetBufferDescriptor(i);
+ BufferStrategyFreelist *freelist;
+ int belongs_to = 0; /* first freelist by default */
+
+ /*
+ * Split the freelist into partitions, if needed (or just keep the
+ * freelist we already built in BufferManagerShmemInit().
+ */
+ if ((numa_partition_freelist == FREELIST_PARTITION_CPU) ||
+ (numa_partition_freelist == FREELIST_PARTITION_PID))
+ {
+ belongs_to = (i % num_cpus);
+ }
+ else if (numa_partition_freelist == FREELIST_PARTITION_NODE)
+ {
+ /* determine NUMA node for buffer */
+ belongs_to = BufferGetNode(i);
+ }
+
+ /* add to the right freelist */
+ freelist = &StrategyControl->freelists[belongs_to];
+
+ buf->freeNext = freelist->firstFreeBuffer;
+ freelist->firstFreeBuffer = i;
+ }
}
else
Assert(!init);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..7febf3001a3 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int MaxBackends = 0;
/* NUMA stuff */
bool numa_buffers_interleave = false;
bool numa_localalloc = false;
+int numa_partition_freelist = 0;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 57f2df7ab74..e2361c161e6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -491,6 +491,14 @@ static const struct config_enum_entry file_copy_method_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry freelist_partition_options[] = {
+ {"none", FREELIST_PARTITION_NONE, false},
+ {"node", FREELIST_PARTITION_NODE, false},
+ {"cpu", FREELIST_PARTITION_CPU, false},
+ {"pid", FREELIST_PARTITION_PID, false},
+ {NULL, 0, false}
+};
+
/*
* Options for enum values stored in other modules
*/
@@ -5284,6 +5292,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"numa_partition_freelist", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+ gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+ },
+ &numa_partition_freelist,
+ FREELIST_PARTITION_NONE, freelist_partition_options,
+ NULL, NULL, NULL
+ },
+
{
{"wal_sync_method", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Selects the method used for forcing WAL updates to disk."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..17528439f07 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT int numa_partition_freelist;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c257c8a1c20..efb7e28c10f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -93,6 +93,14 @@ typedef enum ExtendBufferedFlags
EB_LOCK_TARGET = (1 << 5),
} ExtendBufferedFlags;
+typedef enum FreelistPartitionMode
+{
+ FREELIST_PARTITION_NONE,
+ FREELIST_PARTITION_NODE,
+ FREELIST_PARTITION_CPU,
+ FREELIST_PARTITION_PID,
+} FreelistPartitionMode;
+
/*
* Some functions identify relations either by relation or smgr +
* relpersistence. Used via the BMR_REL()/BMR_SMGR() macros below. This
--
2.49.0
v1-0005-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v1-0005-NUMA-interleave-PGPROC-entries.patchDownload
From 05c594ed8eb8a266a74038c3131d12bb03d897e3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v1 5/6] NUMA: interleave PGPROC entries
The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.
We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.
Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.
Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.
To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.
The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):
- PGPROC array / node #0
- PGPROC array / node #1
- PGPROC array / aux processes + 2PC transactions
- fast-path arrays / node #0
- fast-path arrays / node #1
- fast-path arrays / aux processes + 2PC transaction
Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.
Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.
Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).
Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.
Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.
Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
src/backend/access/transam/clog.c | 4 +-
src/backend/postmaster/pgarch.c | 2 +-
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/storage/buffer/freelist.c | 2 +-
src/backend/storage/ipc/procarray.c | 62 ++---
src/backend/storage/lmgr/lock.c | 6 +-
src/backend/storage/lmgr/proc.c | 368 +++++++++++++++++++++++--
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/miscadmin.h | 1 +
src/include/storage/proc.h | 11 +-
11 files changed, 407 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 48f10bec91e..90ddff37bc6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -576,7 +576,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ PGPROC *nextproc = ProcGlobal->allProcs[nextidx];
int64 thispageno = nextproc->clogGroupMemberPage;
/*
@@ -635,7 +635,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
*/
while (wakeidx != INVALID_PROC_NUMBER)
{
- PGPROC *wakeproc = &ProcGlobal->allProcs[wakeidx];
+ PGPROC *wakeproc = ProcGlobal->allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7e622ae4bd2..75c0e4bf53c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
* be relaunched shortly and will start archiving.
*/
if (arch_pgprocno != INVALID_PROC_NUMBER)
- SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
}
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 0fec4f1f871..0044ef54363 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
LWLockRelease(WALSummarizerLock);
if (pgprocno != INVALID_PROC_NUMBER)
- SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
}
/*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c93ec2841c5..4e390a77a71 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -360,7 +360,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* actually fine because procLatch isn't ever freed, so we just can
* potentially set the wrong process' (or no process') latch.
*/
- SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5b945a9ee3..3277480fbcf 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
static ProcArrayStruct *procArray;
-static PGPROC *allProcs;
+static PGPROC **allProcs;
/*
* Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
int this_procno = arrayP->pgprocnos[index];
Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[this_procno].pgxactoff == index);
+ Assert(allProcs[this_procno]->pgxactoff == index);
/* If we have found our right position in the array, break */
if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
int procno = arrayP->pgprocnos[index];
Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[procno].pgxactoff == index - 1);
+ Assert(allProcs[procno]->pgxactoff == index - 1);
- allProcs[procno].pgxactoff = index;
+ allProcs[procno]->pgxactoff = index;
}
/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
myoff = proc->pgxactoff;
Assert(myoff >= 0 && myoff < arrayP->numProcs);
- Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+ Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
if (TransactionIdIsValid(latestXid))
{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
int procno = arrayP->pgprocnos[index];
Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[procno].pgxactoff - 1 == index);
+ Assert(allProcs[procno]->pgxactoff - 1 == index);
- allProcs[procno].pgxactoff = index;
+ allProcs[procno]->pgxactoff = index;
}
/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
/* Walk the list and clear all XIDs. */
while (nextidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &allProcs[nextidx];
+ PGPROC *nextproc = allProcs[nextidx];
ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
*/
while (wakeidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &allProcs[wakeidx];
+ PGPROC *nextproc = allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
pxids = other_subxidstates[pgxactoff].count;
pg_read_barrier(); /* pairs with barrier in GetNewTransactionId() */
pgprocno = arrayP->pgprocnos[pgxactoff];
- proc = &allProcs[pgprocno];
+ proc = allProcs[pgprocno];
for (j = pxids - 1; j >= 0; j--)
{
/* Fetch xid just once - see GetNewTransactionId */
@@ -1650,7 +1650,7 @@ TransactionIdIsActive(TransactionId xid)
for (i = 0; i < arrayP->numProcs; i++)
{
int pgprocno = arrayP->pgprocnos[i];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
TransactionId pxid;
/* Fetch xid just once - see GetNewTransactionId */
@@ -1792,7 +1792,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
for (int index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int8 statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
TransactionId xmin;
@@ -2276,7 +2276,7 @@ GetSnapshotData(Snapshot snapshot)
TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
uint8 statusFlags;
- Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+ Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
/*
* If the transaction has no XID assigned, we can skip it; it
@@ -2350,7 +2350,7 @@ GetSnapshotData(Snapshot snapshot)
if (nsubxids > 0)
{
int pgprocno = pgprocnos[pgxactoff];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
pg_read_barrier(); /* pairs with GetNewTransactionId */
@@ -2551,7 +2551,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
@@ -2777,7 +2777,7 @@ GetRunningTransactionData(void)
if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->databaseId == MyDatabaseId)
oldestDatabaseRunningXid = xid;
@@ -2808,7 +2808,7 @@ GetRunningTransactionData(void)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int nsubxids;
/*
@@ -3058,7 +3058,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if ((proc->delayChkptFlags & type) != 0)
{
@@ -3099,7 +3099,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
VirtualTransactionId vxid;
GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3227,7 +3227,7 @@ BackendPidGetProcWithLock(int pid)
for (index = 0; index < arrayP->numProcs; index++)
{
- PGPROC *proc = &allProcs[arrayP->pgprocnos[index]];
+ PGPROC *proc = allProcs[arrayP->pgprocnos[index]];
if (proc->pid == pid)
{
@@ -3270,7 +3270,7 @@ BackendXidGetPid(TransactionId xid)
if (other_xids[index] == xid)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
result = proc->pid;
break;
@@ -3339,7 +3339,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc == MyProc)
@@ -3441,7 +3441,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/* Exclude prepared transactions */
if (proc->pid == 0)
@@ -3506,7 +3506,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3561,7 +3561,7 @@ MinimumActiveBackends(int min)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/*
* Since we're not holding a lock, need to be prepared to deal with
@@ -3607,7 +3607,7 @@ CountDBBackends(Oid databaseid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3636,7 +3636,7 @@ CountDBConnections(Oid databaseid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3667,7 +3667,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (databaseid == InvalidOid || proc->databaseId == databaseid)
{
@@ -3708,7 +3708,7 @@ CountUserBackends(Oid roleid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3771,7 +3771,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc->databaseId != databaseId)
@@ -3837,7 +3837,7 @@ TerminateOtherDBBackends(Oid databaseId)
for (i = 0; i < procArray->numProcs; i++)
{
int pgprocno = arrayP->pgprocnos[i];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->databaseId != databaseId)
continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2776ceb295b..95b1da42408 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
*/
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
uint32 j;
LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
*/
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
uint32 j;
/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
*/
for (i = 0; i < ProcGlobal->allProcCount; ++i)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
/* Skip backends with pid=0, as they don't hold fast-path locks */
if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..9d3e94a7b3a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
*/
#include "postgres.h"
+#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_numa.h"
#include "postmaster/autovacuum.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
#include "storage/condition_variable.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
@@ -89,6 +97,12 @@ static void ProcKill(int code, Datum arg);
static void AuxiliaryProcKill(int code, Datum arg);
static void CheckDeadLock(void);
+/* NUMA */
+static Size get_memory_page_size(void); /* XXX duplicate */
+static void move_to_node(char *startptr, char *endptr,
+ Size mem_page_size, int node);
+static int numa_nodes = -1;
+
/*
* Report shared-memory space needed by PGPROC.
@@ -100,11 +114,40 @@ PGProcShmemSize(void)
Size TotalProcs =
add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
+ /*
+ * With NUMA, we allocate the PGPROC array in several chunks. With shared
+ * buffers we simply manually assign parts of the buffer array to
+ * different NUMA nodes, and that does the trick. But we can't do that for
+ * PGPROC, as the number of PGPROC entries is much lower, especially with
+ * huge pages. We can fit ~2k entries on a 2MB page, and NUMA does stuff
+ * with page granularity, and the large NUMA systems are likely to use
+ * huge pages. So with sensible max_connections we would not use more than
+ * a single page, which means it gets to a single NUMA node.
+ *
+ * So we allocate PGPROC not as a single array, but one array per NUMA
+ * node, and then one array for aux processes (without NUMA node
+ * assigned). Each array may need up to memory-page-worth of padding,
+ * worst case. So we just add that - it's a bit wasteful, but good enough
+ * for PoC.
+ *
+ * FIXME Should be conditional, but that was causing problems in bootstrap
+ * mode. Or maybe it was because the code that allocates stuff later does
+ * not do that conditionally. Anyway, needs to be fixed.
+ */
+ /* if (numa_procs_interleave) */
+ {
+ int num_nodes = numa_num_configured_nodes();
+ Size mem_page_size = get_memory_page_size();
+
+ size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+ }
+
return size;
}
@@ -129,6 +172,26 @@ FastPathLockShmemSize(void)
size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
+ /*
+ * Same NUMA-padding logic as in PGProcShmemSize, adding a memory page per
+ * NUMA node - but this way we add two pages per node - one for PGPROC,
+ * one for fast-path arrays. In theory we could make this work just one
+ * page per node, by adding fast-path arrays right after PGPROC entries on
+ * each node. But now we allocate fast-path locks separately - good enough
+ * for PoC.
+ *
+ * FIXME Should be conditional, but that was causing problems in bootstrap
+ * mode. Or maybe it was because the code that allocates stuff later does
+ * not do that conditionally. Anyway, needs to be fixed.
+ */
+ /* if (numa_procs_interleave) */
+ {
+ int num_nodes = numa_num_configured_nodes();
+ Size mem_page_size = get_memory_page_size();
+
+ size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+ }
+
return size;
}
@@ -191,11 +254,13 @@ ProcGlobalSemas(void)
void
InitProcGlobal(void)
{
- PGPROC *procs;
+ PGPROC **procs;
int i,
j;
bool found;
uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+ int procs_total;
+ int procs_per_node;
/* Used for setup of per-backend fast-path slots. */
char *fpPtr,
@@ -205,6 +270,8 @@ InitProcGlobal(void)
Size requestSize;
char *ptr;
+ Size mem_page_size = get_memory_page_size();
+
/* Create the ProcGlobal shared structure */
ProcGlobal = (PROC_HDR *)
ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -224,6 +291,9 @@ InitProcGlobal(void)
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
+ /* one chunk per NUMA node (without NUMA assume 1 node) */
+ numa_nodes = numa_num_configured_nodes();
+
/*
* Create and initialize all the PGPROC structures we'll need. There are
* six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +311,108 @@ InitProcGlobal(void)
MemSet(ptr, 0, requestSize);
- procs = (PGPROC *) ptr;
- ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+ /* allprocs (array of pointers to PGPROC entries) */
+ procs = (PGPROC **) ptr;
+ ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
ProcGlobal->allProcs = procs;
/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
+ /*
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it now.
+ */
+ procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+ procs_total = 0;
+
+ /* build PGPROC entries for NUMA nodes */
+ for (i = 0; i < numa_nodes; i++)
+ {
+ PGPROC *procs_node;
+
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ int count_node = Min(procs_per_node, MaxBackends - procs_total);
+
+ /* make sure to align the PGPROC array to memory page */
+ ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+ /* allocate the PGPROC chunk for this node */
+ procs_node = (PGPROC *) ptr;
+ ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+ /* don't overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+ /* add pointers to the PGPROC entries to allProcs */
+ for (j = 0; j < count_node; j++)
+ {
+ procs_node[j].numa_node = i;
+ procs_node[j].procnumber = procs_total;
+
+ ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+ }
+
+ move_to_node((char *) procs_node, ptr, mem_page_size, i);
+ }
+
+ /*
+ * also build PGPROC entries for auxiliary procs / prepared xacts (we
+ * don't assign those to any NUMA node)
+ *
+ * XXX Mostly duplicate of preceding block, could be reused.
+ */
+ {
+ PGPROC *procs_node;
+ int count_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+ /*
+ * Make sure to align PGPROC array to memory page (it may not be
+ * aligned). We won't assign this to any NUMA node, but we still don't
+ * want it to interfere with the preceding chunk (for the last NUMA
+ * node).
+ */
+ ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+ procs_node = (PGPROC *) ptr;
+ ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+ /* don't overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+ /* now add the PGPROC pointers to allProcs */
+ for (j = 0; j < count_node; j++)
+ {
+ procs_node[j].numa_node = -1;
+ procs_node[j].procnumber = procs_total;
+
+ ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+ }
+ }
+
+ /* we should have allocated the expected number of PGPROC entries */
+ Assert(procs_total == TotalProcs);
+
/*
* Allocate arrays mirroring PGPROC fields in a dense manner. See
* PROC_HDR.
*
* XXX: It might make sense to increase padding for these arrays, given
* how hotly they are accessed.
+ *
+ * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+ * But those arrays are tiny, fit into a single memory page, so would need
+ * to be made more complex. Not sure.
*/
ProcGlobal->xids = (TransactionId *) ptr;
ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,23 +445,100 @@ InitProcGlobal(void)
/* For asserts checking we did not overflow. */
fpEndPtr = fpPtr + requestSize;
- for (i = 0; i < TotalProcs; i++)
+ /* reset the count */
+ procs_total = 0;
+
+ /*
+ * Mimic the same logic as above, but for fast-path locking.
+ */
+ for (i = 0; i < numa_nodes; i++)
{
- PGPROC *proc = &procs[i];
+ char *startptr;
+ char *endptr;
- /* Common initialization for all PGPROCs, regardless of type. */
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ int procs_node = Min(procs_per_node, MaxBackends - procs_total);
+
+ /* align to memory page, to make move_pages possible */
+ fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+ startptr = fpPtr;
+ endptr = fpPtr + procs_node * (fpLockBitsSize + fpRelIdSize);
+
+ move_to_node(startptr, endptr, mem_page_size, i);
/*
- * Set the fast-path lock arrays, and move the pointer. We interleave
- * the two arrays, to (hopefully) get some locality for each backend.
+ * Now point the PGPROC entries to the fast-path arrays, and also
+ * advance the fpPtr.
*/
- proc->fpLockBits = (uint64 *) fpPtr;
- fpPtr += fpLockBitsSize;
+ for (j = 0; j < procs_node; j++)
+ {
+ PGPROC *proc = ProcGlobal->allProcs[procs_total++];
+
+ /* cross-check we got the expected NUMA node */
+ Assert(proc->numa_node == i);
+ Assert(proc->procnumber == (procs_total - 1));
+
+ /*
+ * Set the fast-path lock arrays, and move the pointer. We
+ * interleave the two arrays, to (hopefully) get some locality for
+ * each backend.
+ */
+ proc->fpLockBits = (uint64 *) fpPtr;
+ fpPtr += fpLockBitsSize;
- proc->fpRelId = (Oid *) fpPtr;
- fpPtr += fpRelIdSize;
+ proc->fpRelId = (Oid *) fpPtr;
+ fpPtr += fpRelIdSize;
- Assert(fpPtr <= fpEndPtr);
+ Assert(fpPtr <= fpEndPtr);
+ }
+
+ Assert(fpPtr == endptr);
+ }
+
+ /* auxiliary processes / prepared xacts */
+ {
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ int procs_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+ /* align to memory page, to make move_pages possible */
+ fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+ /* now point the PGPROC entries to the fast-path arrays */
+ for (j = 0; j < procs_node; j++)
+ {
+ PGPROC *proc = ProcGlobal->allProcs[procs_total++];
+
+ /* cross-check we got PGPROC with no NUMA node assigned */
+ Assert(proc->numa_node == -1);
+ Assert(proc->procnumber == (procs_total - 1));
+
+ /*
+ * Set the fast-path lock arrays, and move the pointer. We
+ * interleave the two arrays, to (hopefully) get some locality for
+ * each backend.
+ */
+ proc->fpLockBits = (uint64 *) fpPtr;
+ fpPtr += fpLockBitsSize;
+
+ proc->fpRelId = (Oid *) fpPtr;
+ fpPtr += fpRelIdSize;
+
+ Assert(fpPtr <= fpEndPtr);
+ }
+ }
+
+ /* Should have consumed exactly the expected amount of fast-path memory. */
+ Assert(fpPtr <= fpEndPtr);
+
+ /* make sure we allocated the expected number of PGPROC entries */
+ Assert(procs_total == TotalProcs);
+
+ for (i = 0; i < TotalProcs; i++)
+ {
+ PGPROC *proc = procs[i];
+
+ Assert(proc->procnumber == i);
/*
* Set up per-PGPROC semaphore, latch, and fpInfoLock. Prepared xact
@@ -366,15 +602,12 @@ InitProcGlobal(void)
pg_atomic_init_u64(&(proc->waitStart), 0);
}
- /* Should have consumed exactly the expected amount of fast-path memory. */
- Assert(fpPtr == fpEndPtr);
-
/*
* Save pointers to the blocks of PGPROC structures reserved for auxiliary
* processes and prepared transactions.
*/
- AuxiliaryProcs = &procs[MaxBackends];
- PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+ AuxiliaryProcs = procs[MaxBackends];
+ PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
/* Create ProcStructLock spinlock, too */
ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +668,45 @@ InitProcess(void)
if (!dlist_is_empty(procgloballist))
{
- MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+ /*
+ * With numa interleaving of PGPROC, try to get a PROC entry from the
+ * right NUMA node (when the process starts).
+ *
+ * XXX The process may move to a different NUMA node later, but
+ * there's not much we can do about that.
+ */
+ if (numa_procs_interleave)
+ {
+ dlist_mutable_iter iter;
+ unsigned cpu;
+ unsigned node;
+ int rc;
+
+ rc = getcpu(&cpu, &node);
+ if (rc != 0)
+ elog(ERROR, "getcpu failed: %m");
+
+ MyProc = NULL;
+
+ dlist_foreach_modify(iter, procgloballist)
+ {
+ PGPROC *proc;
+
+ proc = dlist_container(PGPROC, links, iter.cur);
+
+ if (proc->numa_node == node)
+ {
+ MyProc = proc;
+ dlist_delete(iter.cur);
+ break;
+ }
+ }
+ }
+
+ /* didn't find PGPROC from the correct NUMA node, pick any free one */
+ if (MyProc == NULL)
+ MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
SpinLockRelease(ProcStructLock);
}
else
@@ -1988,7 +2259,7 @@ ProcSendSignal(ProcNumber procNumber)
if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
elog(ERROR, "procNumber out of range");
- SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+ SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
}
/*
@@ -2063,3 +2334,60 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+ Size os_page_size;
+ Size huge_page_size;
+
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ /*
+ * XXX This is a bit annoying/confusing, because we may get a different
+ * result depending on when we call it. Before mmap() we don't know if the
+ * huge pages get used, so we assume they will. And then if we don't get
+ * huge pages, we'll waste memory etc.
+ */
+
+ /* assume huge pages get used, unless HUGE_PAGES_OFF */
+ if (huge_pages_status == HUGE_PAGES_OFF)
+ huge_page_size = 0;
+ else
+ GetHugePageSize(&huge_page_size, NULL);
+
+ return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * move_to_node
+ * move all pages in the given range to the requested NUMA node
+ *
+ * XXX This is expected to only process fairly small number of pages, so no
+ * need to do batching etc. Just move pages one by one.
+ */
+static void
+move_to_node(char *startptr, char *endptr, Size mem_page_size, int node)
+{
+ while (startptr < endptr)
+ {
+ int r,
+ status;
+
+ r = numa_move_pages(0, 1, (void **) &startptr, &node, &status, 0);
+
+ if (r != 0)
+ elog(WARNING, "failed to move page to NUMA node %d (r = %d, status = %d)",
+ node, r, status);
+
+ startptr += mem_page_size;
+ }
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 7febf3001a3..bf775c76545 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int MaxBackends = 0;
bool numa_buffers_interleave = false;
bool numa_localalloc = false;
int numa_partition_freelist = 0;
+bool numa_procs_interleave = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e2361c161e6..930082588f2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2144,6 +2144,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+ gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+ },
+ &numa_procs_interleave,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 17528439f07..f454b4e9d75 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT int numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5cb1632718e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,6 +194,8 @@ struct PGPROC
* vacuum must not remove tuples deleted by
* xid >= xmin ! */
+ int procnumber; /* index in ProcGlobal->allProcs */
+
int pid; /* Backend's process ID; 0 if prepared xact */
int pgxactoff; /* offset into various ProcGlobal->arrays with
@@ -319,6 +321,9 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /* NUMA node */
+ int numa_node;
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -383,7 +388,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
typedef struct PROC_HDR
{
/* Array of PGPROC structures (not including dummies for prepared txns) */
- PGPROC *allProcs;
+ PGPROC **allProcs;
/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
TransactionId *xids;
@@ -435,8 +440,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
/*
* Accessors for getting PGPROC given a ProcNumber and vice versa.
*/
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
/*
* We set aside some extra PGPROC structures for "special worker" processes,
--
2.49.0
v1-0006-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v1-0006-NUMA-pin-backends-to-NUMA-nodes.patchDownload
From f76377a56f37421c61c4dd876813b57084b019df Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v1 6/6] NUMA: pin backends to NUMA nodes
When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
src/backend/storage/lmgr/proc.c | 21 +++++++++++++++++++++
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
4 files changed, 33 insertions(+)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9d3e94a7b3a..4c9e55608b2 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -729,6 +729,27 @@ InitProcess(void)
}
MyProcNumber = GetNumberFromPGProc(MyProc);
+ /*
+ * Optionally, restrict the process to only run on CPUs from the same NUMA
+ * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+ * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+ *
+ * This also means we only do this with numa_procs_interleave, because
+ * without that we'll have numa_node=-1 for all PGPROC entries.
+ *
+ * FIXME add proper error-checking for libnuma functions
+ */
+ if (numa_procs_pin && MyProc->numa_node != -1)
+ {
+ struct bitmask *cpumask = numa_allocate_cpumask();
+
+ numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+ numa_sched_setaffinity(MyProcPid, cpumask);
+
+ numa_free_cpumask(cpumask);
+ }
+
/*
* Cross-check that the PGPROC is of the type we expect; if this were not
* the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index bf775c76545..e584ba840ef 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool numa_buffers_interleave = false;
bool numa_localalloc = false;
int numa_partition_freelist = 0;
bool numa_procs_interleave = false;
+bool numa_procs_pin = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 930082588f2..3fc8897ae36 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2154,6 +2154,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+ gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+ },
+ &numa_procs_pin,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f454b4e9d75..d0d960caa9d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT int numa_partition_freelist;
extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
--
2.49.0
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.
The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.
This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?
If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.
I have added Dmitry to this thread since he has written most of the
shared memory handling code.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.
The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?
But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.
Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.
--
Best Wishes,
Ashutosh Bapat
On 7/2/25 13:37, Ashutosh Bapat wrote:
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.
My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization. I don't know how significant is the improvement, but if
it's measurable we could just do that independently of our patches.
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.I have added Dmitry to this thread since he has written most of the
shared memory handling code.
Thanks.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?
Indirectly. My patch can work just fine with a single segment, but being
able to enable huge pages only for some of the segments seems better.
But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.
Right. The good thing is that's just "advisory" information, it doesn't
break anything if it's temporarily out of sync. We don't need to "stop"
everything to remap the buffers to other nodes, or anything like that.
Or at least I think so.
It's one thing to "flip" the target mapping (determining which node a
buffer should be on), and actually migrating the buffers. The first part
can be done instantaneously, the second part can happen in the
background over a longer time period.
I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.
regards
--
Tomas Vondra
On Wed, Jul 2, 2025 at 6:06 PM Tomas Vondra <tomas@vondra.me> wrote:
I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.
Before shrinking the buffers, we walk the free list removing any
buffers that are going to be removed. When expanding, by linking the
new buffers in the order and then adding those to the already existing
free list. 0005 patch in [1]/messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2 has the code for the same.
[1]: /messages/by-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
--
Best Wishes,
Ashutosh Bapat
On Wed, Jul 02, 2025 at 05:07:28PM +0530, Ashutosh Bapat wrote:
There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.I have added Dmitry to this thread since he has written most of the
shared memory handling code.
Thanks! I like the idea behind this patch series. I haven't read it in
details yet, but I can imagine both patches (interleaving and online
resizing) could benefit from each other. In online resizing we've
introduced a possibility to use multiple shared mappings for different
types of data, maybe it would be convenient to use the same interface to
create separate mappings for different NUMA nodes as well. Using a
separate shared mapping per NUMA node would also make resizing easier,
since it would be more straightforward to fit an increased segment into
NUMA boundaries.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?
Right, separate segments would allow to mix and match huge pages with
pages of regular size. It's not implemented in the latest version of
online resizing patch, purely to reduce complexity and maintain the same
invariant (everything is either using huge pages or not) -- but we could
do it other way around as well.
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi!
1) v1-0001-NUMA-interleaving-buffers.patch
[..]
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.
Oh, now I get it! OK, let's stick to this one.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.
You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).
0. I think that we could do better, some counter arguments to
no-configuration-at-all:
a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)
b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?
c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)
d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:
2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.
.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.
Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:
1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:
numa_buffers_interleave=off numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
latency average = 0.373 ms
latency stddev = 0.237 ms
initial connection time = 45.899 ms
tps = 160242.147877 (without initial connection time)
numa_buffers_interleave=on numa_pgproc_interleave=on
latency average = 0.345 ms
latency stddev = 0.373 ms
initial connection time = 44.485 ms
tps = 177564.686094 (without initial connection time)
2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
default/numa_buffers_interleave=off
latency average = 1375.478 ms
latency stddev = 1141.423 ms
initial connection time = 46.104 ms
tps = 45.868075 (without initial connection time)
numa_buffers_interleave=on
latency average = 838.128 ms
latency stddev = 498.787 ms
initial connection time = 43.437 ms
tps = 75.413894 (without initial connection time)
and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
latency average = 727.717 ms
latency stddev = 410.767 ms
initial connection time = 45.119 ms
tps = 86.844161 (without initial connection time)
(but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?
3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)
4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?
5. In v1-0001, pg_numa_interleave_memory()
+ * XXX no return value, to make this fail on error, has to use
+ * numa_set_strict
Yes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.
+ * XXX Should we still touch the memory first, like
with numa_move_pages,
+ * or is that not necessary?
It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.
6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
Accidental indents (also fails to apply)
7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :
2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down
[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0
9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.
-J.
[1]: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
On 7/4/25 13:05, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi!
1) v1-0001-NUMA-interleaving-buffers.patch
[..]
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.Oh, now I get it! OK, let's stick to this one.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.
I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.
I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.
Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).0. I think that we could do better, some counter arguments to
no-configuration-at-all:a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:
I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.
2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc... .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.
I'm sorry, I don't understand what's the question :-(
Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:numa_buffers_interleave=off numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
latency average = 0.373 ms
latency stddev = 0.237 ms
initial connection time = 45.899 ms
tps = 160242.147877 (without initial connection time)numa_buffers_interleave=on numa_pgproc_interleave=on
latency average = 0.345 ms
latency stddev = 0.373 ms
initial connection time = 44.485 ms
tps = 177564.686094 (without initial connection time)2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
default/numa_buffers_interleave=off
latency average = 1375.478 ms
latency stddev = 1141.423 ms
initial connection time = 46.104 ms
tps = 45.868075 (without initial connection time)numa_buffers_interleave=on
latency average = 838.128 ms
latency stddev = 498.787 ms
initial connection time = 43.437 ms
tps = 75.413894 (without initial connection time)and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
latency average = 727.717 ms
latency stddev = 410.767 ms
initial connection time = 45.119 ms
tps = 86.844161 (without initial connection time)(but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
Good, thanks for the testing. I should have done something like this
when I posted my patches, but I forgot about that (and the email felt
too long anyway).
But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.
For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?
Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?
3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)
Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.
wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?
4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?
Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.
But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.
But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.
Still have to figure out the details, but seems it might help.
5. In v1-0001, pg_numa_interleave_memory()
+ * XXX no return value, to make this fail on error, has to use + * numa_set_strictYes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.+ * XXX Should we still touch the memory first, like with numa_move_pages, + * or is that not necessary?It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
Accidental indents (also fails to apply)
7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?
Right, we need to add those. Or actually, we need to think about how
we'd do this for non-NUMA systems. I wonder if we even want to just
build everything the "old way" (without the partitions, etc.).
But per the earlier comment, the partitioning seems beneficial even on
non-NUMA systems, so maybe the shims are good enough OK.
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0
Yeah, good catch. I'll look into that next week.
9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.
Mostly experimental, for benchmarking etc. I agree we may not want to
mess with the task scheduling too much.
Thanks for the feedback!
regards
--
Tomas Vondra
Hi Tomas,
I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.
However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.
6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.Driven by GUC "numa_procs_pin" (default: off).
In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.
We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.
The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:
CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]
CREATE PROFILE profile_name
[IF NOT EXISTS]
USING profile_manager
SET key = value [, key = value]...
[USING profile_manager
SET key = value [, key = value]...]
[...];
CREATE PROFILE MAPPING
[IF NOT EXISTS]
FOR PROFILE profile_name
[MATCH [ ALL | ANY ] (
[ROLE role_name],
[BACKEND TYPE backend_type],
[DATABASE database_name],
[APPLICATION appname]
)];
## PROFILE RESOLUTION ORDER
1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)
As currently designed, this approach allows quite a lot of flexibility:
* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary:
https://blogs.oracle.com/linux/post/numa-balancing )
* Someone else could implement support for Windows or BSD specifics.
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.
Hope this perspective is helpful.
Best regards,
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/5/25 09:09, Cédric Villemain wrote:
Hi Tomas,
I haven't yet had time to fully read all the work and proposals around
NUMA and related features, but I hope to catch up over the summer.However, I think it's important to share some thoughts before it's too
late, as you might find them relevant to the NUMA management code.6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch
This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.Driven by GUC "numa_procs_pin" (default: off).
In my work on more careful PostgreSQL resource management, I've come to
the conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.We are working on a PROFILE and PROFILE MANAGER specification to provide
PostgreSQL with only the APIs and hooks needed so that extensions can
manage whatever they want externally.The basic syntax (not meant to be discussed here, and even the names
might change) is roughly as follows, just to illustrate the intent:CREATE PROFILE MANAGER manager_name [IF NOT EXISTS]
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]CREATE PROFILE profile_name
[IF NOT EXISTS]
USING profile_manager
SET key = value [, key = value]...
[USING profile_manager
SET key = value [, key = value]...]
[...];CREATE PROFILE MAPPING
[IF NOT EXISTS]
FOR PROFILE profile_name
[MATCH [ ALL | ANY ] (
[ROLE role_name],
[BACKEND TYPE backend_type],
[DATABASE database_name],
[APPLICATION appname]
)];## PROFILE RESOLUTION ORDER
1. ALTER ROLE IN DATABASE
2. ALTER ROLE
3. ALTER DATABASE
4. First matching PROFILE MAPPING (global or specific)
5. No profile (fallback)As currently designed, this approach allows quite a lot of flexibility:
* pg_psi is used to ensure the spec is suitable for a cgroup profile
manager (moving PIDs as needed; NUMA and cgroups could work well
together, see e.g. this Linux kernel summary: https://blogs.oracle.com/
linux/post/numa-balancing )* Someone else could implement support for Windows or BSD specifics.
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?
regards
--
Tomas Vondra
Hi Tomas, some more thoughts after the weekend:
On Fri, Jul 4, 2025 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote:
On 7/4/25 13:05, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi!
1) v1-0001-NUMA-interleaving-buffers.patch
[..]
It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.Oh, now I get it! OK, let's stick to this one.
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?
In my book, not being swappable is a win (it's hard for me to imagine
when it could be beneficial to swap out parts of s_b).
I was trying to think about it and also got those:
Anyway mlock() probably sounds like it, but e.g. Rocky 8.10 by default
has max locked memory (ulimit -l) as low as 64kB due to systemd's
DefaultLimitMEMLOCK, but Debian/Ubuntu have those at higher values.
Wasn't expecting that - those are bizzare low values. I think we would
need something like (10000*900)/1024/1024 or more, but with each
PGPROC on a separate page that would be even way more?
Another thing with 4kB pages: there's this big assumption now made
that once we arrive in InitProcess() we won't ever change NUMA node,
so we stick to the PGPROC from where we started (based on getcpu(2)).
Let's assume CPU scheduler reassigned us to differnt node, but we have
now this 4kB patch ready for PGPROC in theory and this means we would
need to rely on the NUMA autobalancing doing it's job to migrate that
4kB page from node to node (to get better local accesses instead of
remote ones). The questions in my head are now like that:
- but we have asked intially asked those PGPROC pages to be localized
on certain node (they have policy), so they won't autobalance? We
would need to somewhere call getcpu() again notice the difference and
unlocalize (clear the NUMA/mbind() policy) for the PGPROC page?
- mlocked() as above says stick to physical RAM page (?) , so it won't move?
- after what time kernel's autobalancing would migrate that page since
switching the active CPU<->node? I mean do we execute enough reads on
this page?
BTW: to move this into pragmatic real, what's the most
one-liner/trivial way to exercise/stress PGPROC?
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).0. I think that we could do better, some counter arguments to
no-configuration-at-all:a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases? (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.2) v1-0002-NUMA-localalloc.patch
This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc... .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.I'm sorry, I don't understand what's the question :-(
That patch reference above, it was a chain of thought from step "d".
What I had in mind was that you cannot remove the patch
`v1-0002-NUMA-localalloc.patch` from the scope if forcing people to
use numactl by not having enough configurability on the PG side. That
is: if someone will have to use systemd+numactl
--interleave/--weighted-interleave then, he will also need to have a
way to use numa_localalloc=on (to override the new/user's policy
default, otherwise local mem allocations are also going to be
interleaved, and we are back to square one). Which brings me to a
point why instead of this toggle, should include the configuration
properly inside from start (it's not that hard apparently).
Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:
[.. removed numbers]
But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.
Well, pretty much the same here. I was after minimizing "stddev" (to
have better predictability of results, especially across restarts) and
increasing available bandwidth [which is pretty much related]. Without
our NUMA work, PG can just put that s_b on any random node or spill
randomly from to another (depending on size of allocation request).
So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?
Sure thing, marked it as `Returned with feedback`, this approach seems
to be much more advanced.
3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?
I'm trying to convey exactly the opposite message or at least that it
might depend on configuration. Please see
/messages/by-id/CAKZiRmxYMPbQ4WiyJWh=Vuw_Ny+hLGH9_9FaacKRJvzZ-smm+w@mail.gmail.com
(btw it should read there that I don't indent spend a lot of thime on
PQ), but anyway: I think we should NOT pin the PQ workers the same
NODE as you do not know if there's CPU left there (same story as with
v1-0006 here).
I'm just proposing quick OS-based interleaving of PQ shm if using all
nodes, literally:
@@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
request_size,
}
*mapped_address = address;
*mapped_size = request_size;
+
+ /* We interleave memory only at creation time. */
+ if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) {
+ elog(DEBUG1, "interleaving shm mem @ %p size=%zu",
*mapped_address, *mapped_size);
+ pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes);
+ }
+
Because then if memory is interleaved you have probably less variance
for memory access. But also from that previous thread:
"So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this"
So wild idea was: maybe PQ shm interleaving should on NUMA
configuration (if intereavling to all nodes, then interleave normally,
but if configuration sets to just 1 NUMA node, it automatically binds
there -- there was '@' support for that in my patch).
4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.
Sure!
But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.Still have to figure out the details, but seems it might help.
Right, no idea how the shared_memory remapping patch will work
(how/when the s_b change will be executed), but we could somehow mark
that number of NUMA zones could be rechecked during SIGHUP (?) and
then just simple compare check if old_numa_num_configured_nodes ==
new_numa_num_configured_nodes is true.
Anyway, I think it's way too advanced for now, don't you think? (like
CPU ballooning [s_b itself] is rare, and NUMA ballooning seems to be
super-wild-rare).
As for the rest, forgot to include this too: getcpu() - this really
needs a portable pg_getcpu() wrapper.
-J.
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?
I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/7/25 16:51, Cédric Villemain wrote:
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).
I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...
The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.
Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.
The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).
regards
--
Tomas Vondra
Hi,
On 2025-07-05 07:09:00 +0000, C�dric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.
I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.
To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.
Greetings,
Andres Freund
On 7/7/25 16:51, Cédric Villemain wrote:
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).
The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?
Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);
...
pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
pmroutine->init_auxilliary != NULL)
pmroutine->init_auxilliary(MyProc);
Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.
Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/7/25 16:51, Cédric Villemain wrote:
* Others might use it to integrate PostgreSQL's own resources (e.g.,
"areas" of shared buffers) into policies.Hope this perspective is helpful.
Can you explain how you want to manage this by an extension defined at
the SQL level, when most of this stuff has to be done when setting up
shared memory, which is waaaay before we have any access to catalogs?I should have said module instead, I didn't follow carefully but at some
point there were discussion about shared buffers resized "on-line".
Anyway, it was just to give some few examples, maybe this one is to be
considered later (I'm focused on cgroup/psi, and precisely reassigning
PIDs as needed).I don't know. I have a hard time imagining what exactly would the
policies / profiles do exactly to respond to changes in the system
utilization. And why should that interfere with this patch ...The main thing patch series aims to implement is partitioning different
pieces of shared memory (buffers, freelists, ...) to better work for
NUMA. I don't think there's that many ways to do this, and I doubt it
makes sense to make this easily customizable from external modules of
any kind. I can imagine providing some API allowing to isolate the
instance on selected NUMA nodes, but that's about it.Yes, there's some relation to the online resizing of shared buffers, in
which case we need to "refresh" some of the information. But AFAICS it's
not very extensive (on top of what already needs to happen after the
resize), and it'd happen within the boundaries of the partitioning
scheme. There's not that much flexibility.The last bit (pinning backends to a NUMA node) is experimental, and
mostly intended for easier evaluation of the earlier parts (e.g. to
limit the noise when processes get moved to a CPU from a different NUMA
node, and so on).The backend pinning can be done by replacing your patch on proc.c to
call an external profile manager doing exactly the same thing maybe ?Similar to:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);...
pmroutine = GetPmRoutineForInitAuxilliary();
if (pmroutine != NULL &&
pmroutine->init_auxilliary != NULL)
pmroutine->init_auxilliary(MyProc);Added on some rare places should cover most if not all the requirement
around process placement (process_shared_preload_libraries() is called
earlier in the process creation I believe).
After a first read I think this works for patches 002 and 005. For this
last one, InitProcGlobal() may setup things as you do but then expose
the choice a bit later, basically in places where you added the if
condition on the GUC: numa_procs_interleave).
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
Hi,
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.
Greetings,
Andres Freund
On 7/8/25 05:04, Andres Freund wrote:
Hi,
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.
That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.
If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.
regards
--
Tomas Vondra
On 7/8/25 03:55, Cédric Villemain wrote:
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).
But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.
I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?
Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.
regards
--
Tomas Vondra
Hi,
On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
On 7/8/25 05:04, Andres Freund wrote:
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.
Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.
I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.
Greetings,
Andres Freund
On 7/8/25 03:55, Cédric Villemain wrote:
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.
Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.
Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:
pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);
This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
On 7/8/25 18:06, Cédric Villemain wrote:
On 7/8/25 03:55, Cédric Villemain wrote:
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.
Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...
Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.
I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.
So what would the alternate policy look like? What use case would the
module be supporting?
regards
--
Tomas Vondra
On 7/8/25 18:06, Cédric Villemain wrote:
On 7/8/25 03:55, Cédric Villemain wrote:
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a way.I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.So what would the alternate policy look like? What use case would the
module be supporting?
That's the whole point: there are very distinct usages of PostgreSQL in
the field. And maybe not all of them will require the policy defined by
PostgreSQL core.
May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.
There are some very early processing for memory setup that I can see as
a current blocker, and here I'd refer a more compliant NUMA api as
proposed by Jakub so it's possible to arrange based on workload,
hardware configuration or other matters. Reworking to get distinct
segment and all as you do is great, and combo of both approach probably
of great interest. There is also this weighted interleave discussed and
probably much more to come in this area in Linux.
I think some points raised already about possible distinct policies, I
am precisely claiming that it is hard to come with one good policy with
limited setup options, thus requirement to keep that flexible enough
(hooks, api, 100 GUc ?).
There is an EPYC story here also, given the NUMA setup can vary
depending on BIOS setup, associated NUMA policy must probably take that
into account (L3 can be either real cache or 4 extra "local" NUMA nodes
- with highly distinct access cost from a RAM module).
Does that change how PostgreSQL will place memory and process? Is it
important or of interest ?
--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D
Hi,
On Wed, Jul 09, 2025 at 06:40:00AM +0000, C�dric Villemain wrote:
On 7/8/25 18:06, C�dric Villemain wrote:
I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.So what would the alternate policy look like? What use case would the
module be supporting?That's the whole point: there are very distinct usages of PostgreSQL in the
field. And maybe not all of them will require the policy defined by
PostgreSQL core.May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.There are some very early processing for memory setup that I can see as a
current blocker, and here I'd refer a more compliant NUMA api as proposed by
Jakub so it's possible to arrange based on workload, hardware configuration
or other matters. Reworking to get distinct segment and all as you do is
great, and combo of both approach probably of great interest.
I think that Tomas's approach helps to have more "predictable" performance
expectations, I mean more consistent over time, fewer "surprises".
While your approach (and Jakub's one)) could help to get performance gains
depending on a "known" context (so less generic).
So, probably having both could make sense but I think that they serve different
purposes.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
On 7/8/25 05:04, Andres Freund wrote:
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.
Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?
That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.
Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1]https://lkml.org/lkml/2024/7/3/352[2]https://lkml.rescloud.iu.edu/2402.2/03227.html (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.
Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);
Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...
So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic.
With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?
I don't have any estimates how often this happens, e.g. for older tasks.
We could measure, kernel 6.16+ has per PID numa_task_migrated in
/proc/{PID}/sched , but I assume we would have to throw backends >>
VCPUs at it, to simulate reality and do some "waves" between different
activity periods of certain pools (I can imagine worst case scenario:
a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep
for a while
b) pgbench "b" open some $VCPU new connections to some other DB, all
active from start (tpcbb or readonly)
c) manually ping CPUs using taskset for each PID all from "b" to
specific NUMA node #2 -- just to simulate unfortunate app working on
every 2nd conn
d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node
#1 is idle, #2 is full, CPU scheduler starts puting "a" backends on
CPUs from #1 , and we should notice PIDs being migrated)
I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.
-J.
p.s. I hope i did write in an understandable way, because I had many
interruptions, so if anything is unclear please let me know.
[1]: https://lkml.org/lkml/2024/7/3/352
[2]: https://lkml.rescloud.iu.edu/2402.2/03227.html
Hi,
On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote:
On 7/2/25 13:37, Ashutosh Bapat wrote:
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.
Yea, I don't see any point in adding buffers to the tail instead of to the
front. We probably want more recently used buffers at the front, since they
(and the associated BufferDesc) are more likely to be in a CPU cache.
This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization.
Optimiziation / simplification. When building multiple freelists it was
harder to maintain the tail pointer, and since it was never used...
+1 to just applying that part.
I don't know how significant is the improvement, but if it's measurable we
could just do that independently of our patches.
I doubt it's really an improvement in any realistic scenario, but it's also
not a regression in any way, since it's never used...
FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single thread than
the freelist, clock sweep scales *considerably* better [1]A single pg_prewarm of a large relation shows a difference between using the freelist and not that's around the noise level, whereas 40 parallel pg_prewarms of seperate relations is over 5x faster when disabling the freelist.. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread (rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.
Also needing to switch between getting buffers from the freelist and the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.
That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.
Greetings,
Andres Freund
[1]: A single pg_prewarm of a large relation shows a difference between using the freelist and not that's around the noise level, whereas 40 parallel pg_prewarms of seperate relations is over 5x faster when disabling the freelist.
A single pg_prewarm of a large relation shows a difference between using the
freelist and not that's around the noise level, whereas 40 parallel
pg_prewarms of seperate relations is over 5x faster when disabling the
freelist.
For the test:
- I modified pg_buffercache_evict_* to put buffers onto the freelist
- Ensured all of shared buffers is allocated by querying
pg_shmem_allocations_numa, as otherwise the workload is dominated by the
kernel zeroing out buffers
- used shared_buffers bigger than the data
- data for single threaded is 9.7GB, data for the parallel case is 40
relations of 610MB each.
- in the single threaded case I pinned postgres to a single core, to make sure
core-to-core variation doesn't play a role
- single threaded case
c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_large');") -t1
concurrent case:
c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');") -t1
On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.
Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.
Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.
If you're not already coding this, I'll jump in. :)
That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.
100% agree here, very clever approach adapting clock sweep to a NUMA world.
best.
-greg
Show quoted text
Greetings,
Andres Freund
Hi,
On 2025-07-09 12:04:00 +0200, Jakub Wartak wrote:
On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
On 7/8/25 05:04, Andres Freund wrote:
On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
The reason it would be advantageous to put something like the procarray onto
smaller pages is that otherwise the entire procarray (unless particularly
large) ends up on a single NUMA node, increasing the latency for backends on
every other numa node and increasing memory traffic on that node.Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?
Unfortunately it's probably going to be slightly more complicated workloads
that show the effect - the very simplest cases don't go iterate through the
procarray itself anymore.
That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
I'm not really bought into this being a problem. If your system has enough
pressure to swap out the PGPROC array, you're so hosed that this won't make a
difference.
- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1][2] (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);
I'm rather doubtful that it's a good idea to combine numa awareness with numa
balancing. Numa balancing adds latency and makes it much more expensive for
userspace to act in a numa aware way, since it needs to regularly update its
knowledge about where memory resides.
Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?
If we used 4k pages for the procarray we would just have ~4 procs on one page,
if that range were marked as interleaved, it'd probably suffice.
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic.With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?
I doubt that NUMA balancing is going to help a whole lot here, there are too
many procs on one page for that to be helpful. One thing that might be worth
doing is to *increase* the size of PGPROC by moving other pieces of data that
are keyed by ProcNumber into PGPROC.
I think the main thing to avoid is the case where all of PGPROC, buffer
mapping table, ... resides on one NUMA node (e.g. because it's the one
postmaster was scheduled on), as the increased memory traffic will lead to
queries on that node being slower than the other node.
Greetings,
Andres Freund
Hi,
On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.
Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...
Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.If you're not already coding this, I'll jump in. :)
My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().
Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.
There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.
Greetings,
Andres Freund
Hi,
On 2025-07-08 16:06:00 +0000, C�dric Villemain wrote:
Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to any
process executing code from a module.
Parts of your code where you assign/define policy could be in one or more
relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);This way it's easier to manage alternative policies, and also to be able to
adjust when hardware and linux kernel changes.
I am doubtful this makes sense - as you can see patch 05 needs to change a
fair bit of core code to make this work, there's no way we can delegate much
of that to an extension.
But even if it's doable, I think it's *very* premature to focus on such
extensibility at this point - we need to get the basics into a mergeable
state, if you then want to argue for adding extensibility, we can do that at
this stage. Trying to design this for extensibility from the get go, where
that extensibility is very unlikely to be used widely, seems rather likely to
just tank this entire project without getting us anything in return.
Greetings,
Andres Freund
Hi,
Thanks for working on this! I think it's an area we have long neglected...
On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.
Wonder if some of it might be worth putting into a multi-valued GUC (like
debug_io_direct).
1) v1-0001-NUMA-interleaving-buffers.patch
This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).
Two more disadvantages:
With OS interleaving postgres doesn't (not easily at least) know about what
maps to what, which means postgres can't do stuff like numa aware buffer
replacement.
With OS interleaving the interleaving is "too fine grained", with pages being
mapped at each page boundary, making it less likely for things like one
strategy ringbuffer to reside on a single numa node.
I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
pages...
4) v1-0004-NUMA-partition-buffer-freelist.patch
Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.
* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.* none - nothing, sigle freelist
Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.
I think we might eventually want something more granular than just "node" -
the freelist (and the clock sweep) can become a contention point even within
one NUMA node. I'm imagining something like an array of freelists/clocksweep
states, where the current numa node selects a subset of the array and the cpu
is used to choose the entry within that list.
But we can do that later, that should be a fairly simple extension of what
you're doing.
The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).I do have a separate experimental patch doing something like that, I
need to make it part of this branch.
I'm really curious about that patch, as I wrote elsewhere in this thread, I
think we should just get rid of the freelist alltogether. Even if we don't do
so, in a steady state system the clock sweep is commonly much more important
than the freelist...
5) v1-0005-NUMA-interleave-PGPROC-entries.patch
Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).
(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.
We should probably pad them regardless? Right now sizeof(PGPROC) happens to be
multiple of 64 (i.e. the most common cache line size), but that hasn't always
been the case, and isn't the case on systems with 128 bit cachelines like
common ARMv8 systems. And having one cacheline hold one backends fast path
states and another backend's xmin doesn't sound like a recipe for good
performance.
Seems like we should also do some reordering of the contents within PGPROC. We
have e.g. have very frequently changing data (->waitStatus, ->lwWaiting) in
the same caceheline as almost immutable data (->pid, ->pgxactoff,
->databaseId,).
So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.
Theoretically we could use the "padding" memory at the end of each NUMA node's
PGPROC array to for the 2PC entries, for those we presumably don't care for
locality. Not sure it's worth the complexity though.
For a while I thought I had a better solution: Given that we're going to waste
all the "padding" memory, why not just oversize the PGPROC array so that it
spans the required number of NUMA nodes?
The problem is that that would lead to ProcNumbers to get much larger, and we
do have other arrays that are keyed by ProcNumber. Which probably makes this
not so great an idea.
This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.
I'd not be surprised if there were overhead, adding a level of indirection to
things like ProcArrayGroupClearXid(), GetVirtualXIDsDelayingChkpt(),
SignalVirtualTransaction() probably won't be free.
BUT: For at least some of these a better answer might be to add additional
"dense" arrays like we have for xids etc, so they don't need to trawl through
PGPROCs.
There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).
I guess it might be worth partitioning the freelist, iterating through a few
thousand links just to discover that there's no free proc on the current numa
node, while holding a spinlock, doesn't sound great. Even if it's likely
rarely a huge issue compared to other costs.
The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).
That seems like the right thing to me.
One thing that this patchset afaict doesn't address so far is that there is a
fair bit of other important shared memory that this patch doesn't set up
intelligently e.g. the buffer mapping table itself (but there are loads of
other cases). Because we touch a lot of that memory during startup, most it
will be allocated on whatever NUMA node postmaster was scheduled. I suspect
that the best we can do for parts of shared memory where we don't have
explicit NUMA awareness is to default to an interleave policy.
From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 6 May 2025 21:12:21 +0200
Subject: [PATCH v1 1/6] NUMA: interleaving buffersEnsure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc. It's less dependent on what the CPU
scheduler does, etc.
FWIW, I don't think zone_reclaim_mode will commonly do that? Even if enabled,
which I don't think it is anymore by default. At least huge pages can't be
reclaimed by the kernel, but even when not using huge pages, I think the only
scenario where that would happen is if shared_buffers were swapped out.
Numa balancing might eventually "fix" such an imbalance though.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index ed1dc488a42..2ad34624c49 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -14,9 +14,17 @@ */ #include "postgres.h"+#ifdef USE_LIBNUMA +#include <numa.h> +#include <numaif.h> +#endif +
I wonder how much of this we should try to put into port/pg_numa.c. Having
direct calls to libnuma code all over the backend will make it rather hard to
add numa awareness for hypothetical platforms not using libnuma compatible
interfaces.
+/* number of buffers allocated on the same NUMA node */ +static int64 numa_chunk_buffers = -1;
Given that NBuffers is a 32bit quantity, this probably doesn't need to be
64bit... Anyway, I'm not going to review on that level going forward, the
patch is probably in too early a state for that.
@@ -71,18 +92,80 @@ BufferManagerShmemInit(void) foundDescs, foundIOCV, foundBufCkpt; + Size mem_page_size; + Size buffer_align; + + /* + * XXX A bit weird. Do we need to worry about postmaster? Could this even + * run outside postmaster? I don't think so.
It can run in single user mode - but that shouldn't prevent us from using
pg_get_shmem_pagesize().
+ * XXX Another issue is we may get different values than when sizing the + * the memory, because at that point we didn't know if we get huge pages, + * so we assumed we will. Shouldn't cause crashes, but we might allocate + * shared memory and then not use some of it (because of the alignment + * that we don't actually need). Not sure about better way, good for now. + */
Ugh, not seeing a great way to deal with that either.
+ * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to + * align to mem_page_size? Especially for very large huge pages (e.g. 1GB) + * that doesn't seem quite worth it. Maybe we should simply align to + * BLCKSZ, so that buffers don't get split? Still, we might interfere with + * other stuff stored in shared memory that we want to allocate on a + * particular NUMA node (e.g. ProcArray). + * + * XXX Maybe with "too large" huge pages we should just not do this, or + * maybe do this only for sufficiently large areas (e.g. shared buffers, + * but not ProcArray).
I think that's right - there's no point in using 1GB pages for anything other
than shared_buffers, we should allocate shared_buffers separately.
+/* + * Determine the size of memory page. + * + * XXX This is a bit tricky, because the result depends at which point we call + * this. Before the allocation we don't know if we succeed in allocating huge + * pages - but we have to size everything for the chance that we will. And then + * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory + * pages. But at that point we can't adjust the sizing. + * + * XXX Maybe with huge_pages=try we should do the sizing twice - first with + * huge pages, and if that fails, then without them. But not for this patch. + * Up to this point there was no such dependency on huge pages.
Doing it twice sounds somewhat nasty - but perhaps we could just have the
shmem size infrastructure compute two different numbers, one for use with huge
pages and one without?
+static int64 +choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes) +{ + int64 num_items; + int64 max_items; + + /* make sure the chunks will align nicely */ + Assert(BLCKSZ % sizeof(BufferDescPadded) == 0); + Assert(mem_page_size % sizeof(BufferDescPadded) == 0); + Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0)); + + /* + * The minimum number of items to fill a memory page with descriptors and + * blocks. The NUMA allocates memory in pages, and we need to do that for + * both buffers and descriptors. + * + * In practice the BLCKSZ doesn't really matter, because it's much larger + * than BufferDescPadded, so the result is determined buffer descriptors. + * But it's clearer this way. + */ + num_items = Max(mem_page_size / sizeof(BufferDescPadded), + mem_page_size / BLCKSZ); + + /* + * We shouldn't use chunks larger than NBuffers/num_nodes, because with + * larger chunks the last NUMA node would end up with much less memory (or + * no memory at all). + */ + max_items = (NBuffers / num_nodes); + + /* + * Did we already exceed the maximum desirable chunk size? That is, will + * the last node get less than one whole chunk (or no memory at all)? + */ + if (num_items > max_items) + elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)", + num_items, max_items); + + /* grow the chunk size until we hit the max limit. */ + while (2 * num_items <= max_items) + num_items *= 2;
Something around this logic leads to a fair bit of imbalance - I started postgres with
huge_page_size=1GB, shared_buffers=4GB on a 2 node system and that results in
postgres[4188255][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│ name │ numa_node │ size │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks │ 0 │ 5368709120 │
│ Buffer Blocks │ 1 │ 1073741824 │
│ Buffer Descriptors │ 0 │ 1073741824 │
│ Buffer Descriptors │ 1 │ 1073741824 │
└────────────────────┴───────────┴────────────┘
(4 rows)
With shared_buffers=8GB postgres failed to start, even though 16 1GB huge
pages are available, as 18GB were requested.
After increasing the limit, the top allocations were as follows:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations ORDER BY allocated_size DESC LIMIT 5;
┌──────────────────────┬─────────────┬────────────┬────────────────┐
│ name │ off │ size │ allocated_size │
├──────────────────────┼─────────────┼────────────┼────────────────┤
│ Buffer Blocks │ 1192223104 │ 9663676416 │ 9663676416 │
│ PGPROC structures │ 10970279808 │ 3221733342 │ 3221733376 │
│ Fast-Path Lock Array │ 14192013184 │ 3221396544 │ 3221396608 │
│ Buffer Descriptors │ 51372416 │ 1140850688 │ 1140850688 │
│ (null) │ 17468590976 │ 785020032 │ 785020032 │
└──────────────────────┴─────────────┴────────────┴────────────────┘
With a fair bit of imbalance:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│ name │ numa_node │ size │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks │ 0 │ 8589934592 │
│ Buffer Blocks │ 1 │ 2147483648 │
│ Buffer Descriptors │ 0 │ 0 │
│ Buffer Descriptors │ 1 │ 2147483648 │
└────────────────────┴───────────┴────────────┘
(4 rows)
Note that the buffer descriptors are all on node 1.
+/* + * Calculate the NUMA node for a given buffer. + */ +int +BufferGetNode(Buffer buffer) +{ + /* not NUMA interleaving */ + if (numa_chunk_buffers == -1) + return -1; + + return (buffer / numa_chunk_buffers) % numa_nodes; +}
FWIW, this is likely rather expensive - when not a compile time constant,
divisions and modulo can take a fair number of cycles.
+/* + * pg_numa_interleave_memory + * move memory to different NUMA nodes in larger chunks + * + * startptr - start of the region (should be aligned to page size) + * endptr - end of the region (doesn't need to be aligned) + * mem_page_size - size of the memory page size + * chunk_size - size of the chunk to move to a single node (should be multiple + * of page size + * num_nodes - number of nodes to allocate memory to + * + * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead? + * That might be more efficient than numa_move_pages, as it works on larger + * chunks of memory, not individual system pages, I think. + * + * XXX The "interleave" name is not quite accurate, I guess. + */ +static void +pg_numa_interleave_memory(char *startptr, char *endptr, + Size mem_page_size, Size chunk_size, + int num_nodes) +{
Seems like this should be in pg_numa.c?
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 69b6a877dc9..c07de903f76 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c
I assume those changes weren't intentionally part of this patch...
From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v1 4/6] NUMA: partition buffer freelistInstead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.There are four strategies, specified by GUC numa_partition_freelist
* none - single long freelist, should work just like now
* node - one freelist per NUMA node, with only buffers from that node
* cpu - one freelist per CPU
* pid - freelist determined by PID (same number of freelists as 'cpu')
When allocating a buffer, it's taken from the correct freelist (e.g.
same NUMA node).Note: This is (probably) more important than partitioning ProcArray.
+/* + * Represents one freelist partition. + */ +typedef struct BufferStrategyFreelist +{ + /* Spinlock: protects the values below */ + slock_t freelist_lock; + + /* + * XXX Not sure why this needs to be aligned like this. Need to ask + * Andres. + */ + int firstFreeBuffer __attribute__((aligned(64))); /* Head of list of + * unused buffers */ + + /* Number of buffers consumed from this list. */ + uint64 consumed; +} BufferStrategyFreelist;
I think this might be a leftover from measuring performance of a *non*
partitioned freelist. I saw unnecessar contention between
BufferStrategyControl->{nextVictimBuffer,buffer_strategy_lock,numBufferAllocs}
and was testing what effect the simplest avoidance scheme has.
I don't this should be part of this patchset.
/*
* The shared freelist control information.
@@ -39,8 +66,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;- int firstFreeBuffer; /* Head of list of unused buffers */ - /* * Statistics. These counters should be wide enough that they can't * overflow during a single bgwriter cycle. @@ -51,13 +76,27 @@ typedef struct /* * Bgworker process to be notified upon activity or -1 if none. See * StrategyNotifyBgWriter. + * + * XXX Not sure why this needs to be aligned like this. Need to ask + * Andres. Also, shouldn't the alignment be specified after, like for + * "consumed"? */ - int bgwprocno; + int __attribute__((aligned(64))) bgwprocno; + + BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER]; } BufferStrategyControl;
Here the reason was that it's silly to put almost-readonly data (like
bgwprocno) onto the same cacheline as very frequently modified data like
->numBufferAllocs. That causes unnecessary cache misses in many
StrategyGetBuffer() calls, as another backend's StrategyGetBuffer() will
always have modified ->numBufferAllocs and either ->buffer_strategy_lock or
->nextVictimBuffer.
+static BufferStrategyFreelist * +ChooseFreeList(void) +{ + unsigned cpu; + unsigned node; + int rc; + + int freelist_idx; + + /* freelist not partitioned, return the first (and only) freelist */ + if (numa_partition_freelist == FREELIST_PARTITION_NONE) + return &StrategyControl->freelists[0]; + + /* + * freelist is partitioned, so determine the CPU/NUMA node, and pick a + * list based on that. + */ + rc = getcpu(&cpu, &node); + if (rc != 0) + elog(ERROR, "getcpu failed: %m");
Probably should put this into somewhere abstracted away...
+ /* + * Pick the freelist, based on CPU, NUMA node or process PID. This matches + * how we built the freelists above. + * + * XXX Can we rely on some of the values (especially strategy_nnodes) to + * be a power-of-2? Then we could replace the modulo with a mask, which is + * likely more efficient. + */ + switch (numa_partition_freelist) + { + case FREELIST_PARTITION_CPU: + freelist_idx = cpu % strategy_ncpus;
As mentioned earlier, modulo is rather expensive for something executed so
frequently...
+ break; + + case FREELIST_PARTITION_NODE: + freelist_idx = node % strategy_nnodes; + break;
Here we shouldn't need modulo, right?
+ + case FREELIST_PARTITION_PID: + freelist_idx = MyProcPid % strategy_ncpus; + break; + + default: + elog(ERROR, "unknown freelist partitioning value"); + } + + return &StrategyControl->freelists[freelist_idx]; +}
/* size of lookup hash table ... see comment in StrategyInitialize */
size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));/* size of the shared replacement strategy control block */ - size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl))); + size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists))); + + /* + * Allocate one frelist per CPU. We might use per-node freelists, but the + * assumption is the number of CPUs is less than number of NUMA nodes. + * + * FIXME This assumes the we have more CPUs than NUMA nodes, which seems + * like a safe assumption. But maybe we should calculate how many elements + * we actually need, depending on the GUC? Not a huge amount of memory.
FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
very well could end up as more NUMA nodes than cores.
Ugh, -ETOOLONG. Gotta schedule some other things...
Greetings,
Andres Freund
On Wed, Jul 9, 2025 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:
Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacningI'm not really bought into this being a problem. If your system has enough
pressure to swap out the PGPROC array, you're so hosed that this won't make a
difference.
OK I need to bend here, yet still part of me believes that the
situation where we have hugepages (for 'Buffer Blocks') and yet some
smaller more, but way critical structs are more likely to be swapped
out due to pressure of some backend-gone-wild random mallocs() is
unhealthy to me (especially the fact the OS might prefer swapping on
per node rather than global picture)
I'm rather doubtful that it's a good idea to combine numa awareness with numa
balancing. Numa balancing adds latency and makes it much more expensive for
userspace to act in a numa aware way, since it needs to regularly update its
knowledge about where memory resides.
Well the problem is that backends come here and go to random CPUs
often (migrated++ on very high backend counts and non-uniform
workloads in terms of backend-CPU usage), but the autobalancing
doesn't need to be on or off for everything. It could be autobalancing
for a certain memory region and it is not affecting the app in any way
(well, other than those minor page faulting, literally ).
If we used 4k pages for the procarray we would just have ~4 procs on one page,
if that range were marked as interleaved, it'd probably suffice.
OK, this sounds like the best and simplest proposal to me, yet the
patch doesn't do OS-based interleaving for those today. Gonna try that
mlock() sooner or later... ;)
-J.
On Wed, Jul 9, 2025 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.Wonder if some of it might be worth putting into a multi-valued GUC (like
debug_io_direct).
Long-term or for experimentation? Also please see below as it is related:
[..]
FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
very well could end up as more NUMA nodes than cores.
In my earlier apparently very way too naive approach, I've tried to
handle this CXL scenario, but I'm afraid this cannot be done without
further configuration, please see review/use cases [1]/messages/by-id/attachment/178119/v4-0001-Add-capability-to-interleave-shared-memory-across.patch - just see sgml/GUC and we have numa_parse_nodestring(3) and [2]/messages/by-id/aAKPMrX1Uq6quKJy@ip-10-97-1-34.eu-west-3.compute.internal
-J.
[1]: /messages/by-id/attachment/178119/v4-0001-Add-capability-to-interleave-shared-memory-across.patch - just see sgml/GUC and we have numa_parse_nodestring(3)
- just see sgml/GUC and we have numa_parse_nodestring(3)
[2]: /messages/by-id/aAKPMrX1Uq6quKJy@ip-10-97-1-34.eu-west-3.compute.internal
On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.If you're not already coding this, I'll jump in. :)
My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.
I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the work and cross-reference when I have something reasonable to show (hopefully later today).
There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that.
Working on it. Other than NUMA-fying clocksweep there is a function have_free_buffer() that might be a tad tricky to re-implement efficiently and/or make NUMA aware. Or maybe I can remove that too? It is used in autoprewarm.c and possibly other extensions, but no where else in core.
However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.
Yep, I noted this counter and its potential for contention too. Fortunately, it seems like it is only used so that "bgwriter can estimate the rate of buffer consumption" which to me opens the door to a less accurate partitioned counter, perhaps something lock-free (no mutex/CAS) that is bucketed then combined when read.
A quick look at bufmgr.c indicates that recent_allocs (which is StrategyControl->numBufferAllocs) is used to track a "moving average" and other voodoo there I've yet to fully grok. Any thoughts on this approximate count approach?
Also, what are your thoughts on updating the algorithm to CLOCK-Pro [1]https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html while I'm there? I guess I'd have to try it out, measure it a lot and see if there are any material benefits. Maybe I'll keep that for a future patch, or at least layer it... back to work!
Greetings,
Andres Freund
best.
-greg
Hi,
On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:
Hi,
Thanks for working on this!
Indeed, thanks!
On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
1) v1-0001-NUMA-interleaving-buffers.patch
This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).Two more disadvantages:
With OS interleaving postgres doesn't (not easily at least) know about what
maps to what, which means postgres can't do stuff like numa aware buffer
replacement.With OS interleaving the interleaving is "too fine grained", with pages being
mapped at each page boundary, making it less likely for things like one
strategy ringbuffer to reside on a single numa node.
There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.
I do think this is a big advantage as compare to the OS interleaving.
I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
pages...
I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?
5) v1-0005-NUMA-interleave-PGPROC-entries.patch
Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
cache line size)
Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.
With a bit of reordering:
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5cb1632718e..2ed2f94202a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,8 +194,6 @@ struct PGPROC
* vacuum must not remove tuples deleted by
* xid >= xmin ! */
- int procnumber; /* index in ProcGlobal->allProcs */
-
int pid; /* Backend's process ID; 0 if prepared xact */
int pgxactoff; /* offset into various ProcGlobal->arrays with
@@ -243,6 +241,7 @@ struct PGPROC
/* Support for condition variables. */
proclist_node cvWaitLink; /* position in CV wait list */
+ int procnumber; /* index in ProcGlobal->allProcs */
/* Info about lock the process is currently waiting for, if any. */
/* waitLock and waitProcLock are NULL if not currently waiting. */
@@ -268,6 +267,7 @@ struct PGPROC
*/
XLogRecPtr waitLSN; /* waiting for this LSN or higher */
int syncRepState; /* wait state for sync rep */
+ int numa_node;
dlist_node syncRepLinks; /* list link if process is in syncrep queue */
/*
@@ -321,9 +321,6 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
-
- /* NUMA node */
- int numa_node;
};
That could be back to 832 (the order does not make sense logically speaking
though).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 7/9/25 08:40, Cédric Villemain wrote:
On 7/8/25 18:06, Cédric Villemain wrote:
On 7/8/25 03:55, Cédric Villemain wrote:
Hi Andres,
Hi,
On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote:
In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a
way.I think it's actually the opposite - whenever we pushed stuff like
this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.Except the backend pinning which is easy to achieve, thus my
comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).But an "optimal backend placement" seems to very much depend on
where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.So what would the alternate policy look like? What use case would the
module be supporting?That's the whole point: there are very distinct usages of PostgreSQL in
the field. And maybe not all of them will require the policy defined by
PostgreSQL core.May I ask the reverse: what prevent external modules from taking those
decisions ? There are already a lot of area where external code can take
over PostgreSQL processing, like Neon is doing.
The complexity of making everything extensible in an arbitrary way. To
make it extensible in a useful, we need to have a reasonably clear idea
what aspects need to be extensible, and what's the goal.
There are some very early processing for memory setup that I can see as
a current blocker, and here I'd refer a more compliant NUMA api as
proposed by Jakub so it's possible to arrange based on workload,
hardware configuration or other matters. Reworking to get distinct
segment and all as you do is great, and combo of both approach probably
of great interest. There is also this weighted interleave discussed and
probably much more to come in this area in Linux.I think some points raised already about possible distinct policies, I
am precisely claiming that it is hard to come with one good policy with
limited setup options, thus requirement to keep that flexible enough
(hooks, api, 100 GUc ?).
I'm sorry, I don't want to sound too negative, but "I want arbitrary
extensibility" is not a very useful feedback. I've asked you to give
some examples of policies that'd customize some of the NUMA stuff.
There is an EPYC story here also, given the NUMA setup can vary
depending on BIOS setup, associated NUMA policy must probably take that
into account (L3 can be either real cache or 4 extra "local" NUMA nodes
- with highly distinct access cost from a RAM module).
Does that change how PostgreSQL will place memory and process? Is it
important or of interest ?
So how exactly would the policy handle this? Right now we're entirely
oblivious to L3, or on-CPU caches in general. We don't even consider the
size of L3 when sizing hash tables in a hashjoin etc.
regards
--
Tomas Vondra
On 7/9/25 19:23, Andres Freund wrote:
Hi,
On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.If you're not already coding this, I'll jump in. :)
My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.
Wouldn't it make sense to partition the numBufferAllocs too, though? I
don't remember if my hacky experimental patch NUMA-partitioning did that
or I just thought about doing that, but why wouldn't that be enough?
Places that need the "total" count would have to sum the counters, but
it seemed to me most of the places would be fine with the "local" count
for that partition. If we also make sure to "sync" the clocksweeps so as
to not work on just a single partition, that might be enough ...
regards
--
Tomas Vondra
Hi,
On 2025-07-10 17:31:45 +0200, Tomas Vondra wrote:
On 7/9/25 19:23, Andres Freund wrote:
There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that. However, we also maintain StrategyControl->numBufferAllocs, which is a
significant contention point and would not necessarily be removed by a
NUMAificiation of the clock sweep.Wouldn't it make sense to partition the numBufferAllocs too, though? I
don't remember if my hacky experimental patch NUMA-partitioning did that
or I just thought about doing that, but why wouldn't that be enough?
It could be solved together with partitioning, yes - that's what I was trying
to reference with the emphasized bit in "would *not necessarily* be removed by
a NUMAificiation of the clock sweep".
Greetings,
Andres Freund
Hi,
On 2025-07-10 14:17:21 +0000, Bertrand Drouvot wrote:
On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:
I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
pages...I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?
I don't think the increases here are big enough for that to matter, unless
perhaps you're using 1GB huge pages. But if you're concerned about dropping
tables very fast (i.e. you're running schema change heavy regression tests),
you're not going to use 1GB huge pages.
5) v1-0005-NUMA-interleave-PGPROC-entries.patch
Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
cache line size)Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.
I don't think the patch itself is the problem - it really is just happenstance
that it's a multiple of the line size right now. And it's not on common Armv8
platforms...
With a bit of reordering:
That could be back to 832 (the order does not make sense logically speaking
though).
I don't think shrinking the size in a one-off way just to keep the
"accidental" size-is-multiple-of-64 property is promising. It'll just get
broken again. I think we should:
a) pad the size of PGPROC to a cache line (or even to a subsequent power of 2,
to make array access cheaper, right now that involves actual
multiplications rather than shifts or indexed `lea` instructions).
That's probably just a pg_attribute_aligned
b) Reorder PGPROC to separate frequently modified from almost-read-only data,
to increase cache hit ratio.
Greetings,
Andres Freund
On Jul 10, 2025, at 8:13 AM, Burd, Greg <greg@burd.me> wrote:
On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single
thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread
(rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.Hey Andres, thanks for spending time on this. I've worked before on
freelist implementations (last one in LMDB) and I think you're onto
something. I think it's an innovative idea and that the speed
difference will either be lost in the noise or potentially entirely
mitigated by avoiding duplicate work.Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
perform better because it doesn't need to maintain the freelist anymore...Also needing to switch between getting buffers from the freelist and
the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.If you're not already coding this, I'll jump in. :)
My experimental patch is literally a four character addition ;), namely adding
"0 &&" to the relevant code in StrategyGetBuffer().Obviously a real patch would need to do some more work than that. Feel free
to take on that project, I am not planning on tackling that in near term.I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the work and cross-reference when I have something reasonable to show (hopefully later today).
There's other things around this that could use some attention. It's not hard
to see clock sweep be a bottleneck in concurrent workloads - partially due to
the shared maintenance of the clock hand. A NUMAed clock sweep would address
that.Working on it.
For archival sake, and to tie up loose ends I'll link from here to a new thread I just started that proposes the removal of the freelist and the buffer_strategy_lock [1]/messages/by-id/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me.
That patch set doesn't address any NUMA-related tasks directly, but it should remove some pain when working in that direction by removing code that requires partitioning and locking and...
best.
-greg
[1]: /messages/by-id/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me
Hi,
Here's a v2 of the patch series, with a couple changes:
* I simplified the various freelist partitioning by keeping only the
"node" partitioning (so the cpu/pid strategies are gone). Those were
meant for experimenting, but it made the code more complicated so I
ditched it.
* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So
even if your system is not NUMA, you'll have 4 of them. If you have 2
nodes, you'll still have 4, and each node will get 2. With 3 nodes we
get 6 partitions (we need 2 per node, and we want to keep the number
equal to keep things simple). Once the number of nodes exceeds 4, the
heuristics switches to one partition per node.
I'm aware there's a discussion about maybe simply removing freelists
entirely. If that happens, this becomes mostly irrelevant, of course.
The code should also make sure the freelists "agree" with how the
earlier patch mapped the buffers to NUMA nodes, i.e. the freelist should
only contain buffers from the "correct" NUMA node, etc. I haven't paid
much attention to this - I believe it should work for "nice" values of
shared buffers (when it evenly divides between nodes). But I'm sure it's
possible to confuse that (won't cause crashes, but inefficiency).
* There's now a patch partitioning clocksweep, using the same scheme as
the freelists. I came to the conclusion it doesn't make much sense to
partition these things differently - I can't think of a reason why that
would be advantageous, and it makes it easier to reason about.
The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.
It does work, as in "it doesn't crash". But this part definitely needs
review to make sure I got the changes to the "pacing" right.
* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.
I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.
For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.
regards
--
Tomas Vondra
Attachments:
v2-0007-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v2-0007-NUMA-pin-backends-to-NUMA-nodes.patchDownload
From ca651eb85a6656c79fee5aaabc99e4b772b1b8fe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v2 7/7] NUMA: pin backends to NUMA nodes
When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
src/backend/storage/lmgr/proc.c | 21 +++++++++++++++++++++
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
4 files changed, 33 insertions(+)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9d3e94a7b3a..4c9e55608b2 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -729,6 +729,27 @@ InitProcess(void)
}
MyProcNumber = GetNumberFromPGProc(MyProc);
+ /*
+ * Optionally, restrict the process to only run on CPUs from the same NUMA
+ * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+ * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+ *
+ * This also means we only do this with numa_procs_interleave, because
+ * without that we'll have numa_node=-1 for all PGPROC entries.
+ *
+ * FIXME add proper error-checking for libnuma functions
+ */
+ if (numa_procs_pin && MyProc->numa_node != -1)
+ {
+ struct bitmask *cpumask = numa_allocate_cpumask();
+
+ numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+ numa_sched_setaffinity(MyProcPid, cpumask);
+
+ numa_free_cpumask(cpumask);
+ }
+
/*
* Cross-check that the PGPROC is of the type we expect; if this were not
* the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ee4684d1b8..3f88659b49f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool numa_buffers_interleave = false;
bool numa_localalloc = false;
bool numa_partition_freelist = false;
bool numa_procs_interleave = false;
+bool numa_procs_pin = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7b718760248..862341e137e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2156,6 +2156,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+ gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+ },
+ &numa_procs_pin,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cdeee8dccba..a97741c6707 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT bool numa_partition_freelist;
extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
--
2.49.0
v2-0006-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v2-0006-NUMA-interleave-PGPROC-entries.patchDownload
From 0d79d2fb6ab9f1d5b0b3f03e500315135329b09e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v2 6/7] NUMA: interleave PGPROC entries
The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.
We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.
Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.
Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.
To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.
The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):
- PGPROC array / node #0
- PGPROC array / node #1
- PGPROC array / aux processes + 2PC transactions
- fast-path arrays / node #0
- fast-path arrays / node #1
- fast-path arrays / aux processes + 2PC transaction
Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.
Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.
Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).
Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.
Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.
Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
src/backend/access/transam/clog.c | 4 +-
src/backend/postmaster/pgarch.c | 2 +-
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/storage/buffer/freelist.c | 2 +-
src/backend/storage/ipc/procarray.c | 61 ++--
src/backend/storage/lmgr/lock.c | 6 +-
src/backend/storage/lmgr/proc.c | 368 +++++++++++++++++++++++--
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/miscadmin.h | 1 +
src/include/storage/proc.h | 11 +-
11 files changed, 406 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ PGPROC *nextproc = ProcGlobal->allProcs[nextidx];
int64 thispageno = nextproc->clogGroupMemberPage;
/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
*/
while (wakeidx != INVALID_PROC_NUMBER)
{
- PGPROC *wakeproc = &ProcGlobal->allProcs[wakeidx];
+ PGPROC *wakeproc = ProcGlobal->allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
* be relaunched shortly and will start archiving.
*/
if (arch_pgprocno != INVALID_PROC_NUMBER)
- SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
}
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 777c9a8d555..087279a6a8e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
LWLockRelease(WALSummarizerLock);
if (pgprocno != INVALID_PROC_NUMBER)
- SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
}
/*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 1827e052da7..2ce158ca9bd 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -446,7 +446,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* actually fine because procLatch isn't ever freed, so we just can
* potentially set the wrong process' (or no process') latch.
*/
- SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2418967def6..82158eeb5d6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
static ProcArrayStruct *procArray;
-static PGPROC *allProcs;
+static PGPROC **allProcs;
/*
* Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
int this_procno = arrayP->pgprocnos[index];
Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[this_procno].pgxactoff == index);
+ Assert(allProcs[this_procno]->pgxactoff == index);
/* If we have found our right position in the array, break */
if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
int procno = arrayP->pgprocnos[index];
Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[procno].pgxactoff == index - 1);
+ Assert(allProcs[procno]->pgxactoff == index - 1);
- allProcs[procno].pgxactoff = index;
+ allProcs[procno]->pgxactoff = index;
}
/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
myoff = proc->pgxactoff;
Assert(myoff >= 0 && myoff < arrayP->numProcs);
- Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+ Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
if (TransactionIdIsValid(latestXid))
{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
int procno = arrayP->pgprocnos[index];
Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[procno].pgxactoff - 1 == index);
+ Assert(allProcs[procno]->pgxactoff - 1 == index);
- allProcs[procno].pgxactoff = index;
+ allProcs[procno]->pgxactoff = index;
}
/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
/* Walk the list and clear all XIDs. */
while (nextidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &allProcs[nextidx];
+ PGPROC *nextproc = allProcs[nextidx];
ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
*/
while (wakeidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &allProcs[wakeidx];
+ PGPROC *nextproc = allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
pxids = other_subxidstates[pgxactoff].count;
pg_read_barrier(); /* pairs with barrier in GetNewTransactionId() */
pgprocno = arrayP->pgprocnos[pgxactoff];
- proc = &allProcs[pgprocno];
+ proc = allProcs[pgprocno];
for (j = pxids - 1; j >= 0; j--)
{
/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1622,6 @@ TransactionIdIsInProgress(TransactionId xid)
return false;
}
-
/*
* Determine XID horizons.
*
@@ -1740,7 +1739,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
for (int index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int8 statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
TransactionId xmin;
@@ -2224,7 +2223,7 @@ GetSnapshotData(Snapshot snapshot)
TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
uint8 statusFlags;
- Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+ Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
/*
* If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2297,7 @@ GetSnapshotData(Snapshot snapshot)
if (nsubxids > 0)
{
int pgprocno = pgprocnos[pgxactoff];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
pg_read_barrier(); /* pairs with GetNewTransactionId */
@@ -2499,7 +2498,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
@@ -2725,7 +2724,7 @@ GetRunningTransactionData(void)
if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->databaseId == MyDatabaseId)
oldestDatabaseRunningXid = xid;
@@ -2756,7 +2755,7 @@ GetRunningTransactionData(void)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int nsubxids;
/*
@@ -3006,7 +3005,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if ((proc->delayChkptFlags & type) != 0)
{
@@ -3047,7 +3046,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
VirtualTransactionId vxid;
GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3175,7 +3174,7 @@ BackendPidGetProcWithLock(int pid)
for (index = 0; index < arrayP->numProcs; index++)
{
- PGPROC *proc = &allProcs[arrayP->pgprocnos[index]];
+ PGPROC *proc = allProcs[arrayP->pgprocnos[index]];
if (proc->pid == pid)
{
@@ -3218,7 +3217,7 @@ BackendXidGetPid(TransactionId xid)
if (other_xids[index] == xid)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
result = proc->pid;
break;
@@ -3287,7 +3286,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc == MyProc)
@@ -3389,7 +3388,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/* Exclude prepared transactions */
if (proc->pid == 0)
@@ -3454,7 +3453,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3509,7 +3508,7 @@ MinimumActiveBackends(int min)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/*
* Since we're not holding a lock, need to be prepared to deal with
@@ -3555,7 +3554,7 @@ CountDBBackends(Oid databaseid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3584,7 +3583,7 @@ CountDBConnections(Oid databaseid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3615,7 +3614,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (databaseid == InvalidOid || proc->databaseId == databaseid)
{
@@ -3656,7 +3655,7 @@ CountUserBackends(Oid roleid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3719,7 +3718,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc->databaseId != databaseId)
@@ -3785,7 +3784,7 @@ TerminateOtherDBBackends(Oid databaseId)
for (i = 0; i < procArray->numProcs; i++)
{
int pgprocno = arrayP->pgprocnos[i];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->databaseId != databaseId)
continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..c84a2a5f1bc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
*/
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
uint32 j;
LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
*/
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
uint32 j;
/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
*/
for (i = 0; i < ProcGlobal->allProcCount; ++i)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
/* Skip backends with pid=0, as they don't hold fast-path locks */
if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..9d3e94a7b3a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
*/
#include "postgres.h"
+#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_numa.h"
#include "postmaster/autovacuum.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
#include "storage/condition_variable.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
@@ -89,6 +97,12 @@ static void ProcKill(int code, Datum arg);
static void AuxiliaryProcKill(int code, Datum arg);
static void CheckDeadLock(void);
+/* NUMA */
+static Size get_memory_page_size(void); /* XXX duplicate */
+static void move_to_node(char *startptr, char *endptr,
+ Size mem_page_size, int node);
+static int numa_nodes = -1;
+
/*
* Report shared-memory space needed by PGPROC.
@@ -100,11 +114,40 @@ PGProcShmemSize(void)
Size TotalProcs =
add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
+ /*
+ * With NUMA, we allocate the PGPROC array in several chunks. With shared
+ * buffers we simply manually assign parts of the buffer array to
+ * different NUMA nodes, and that does the trick. But we can't do that for
+ * PGPROC, as the number of PGPROC entries is much lower, especially with
+ * huge pages. We can fit ~2k entries on a 2MB page, and NUMA does stuff
+ * with page granularity, and the large NUMA systems are likely to use
+ * huge pages. So with sensible max_connections we would not use more than
+ * a single page, which means it gets to a single NUMA node.
+ *
+ * So we allocate PGPROC not as a single array, but one array per NUMA
+ * node, and then one array for aux processes (without NUMA node
+ * assigned). Each array may need up to memory-page-worth of padding,
+ * worst case. So we just add that - it's a bit wasteful, but good enough
+ * for PoC.
+ *
+ * FIXME Should be conditional, but that was causing problems in bootstrap
+ * mode. Or maybe it was because the code that allocates stuff later does
+ * not do that conditionally. Anyway, needs to be fixed.
+ */
+ /* if (numa_procs_interleave) */
+ {
+ int num_nodes = numa_num_configured_nodes();
+ Size mem_page_size = get_memory_page_size();
+
+ size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+ }
+
return size;
}
@@ -129,6 +172,26 @@ FastPathLockShmemSize(void)
size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
+ /*
+ * Same NUMA-padding logic as in PGProcShmemSize, adding a memory page per
+ * NUMA node - but this way we add two pages per node - one for PGPROC,
+ * one for fast-path arrays. In theory we could make this work just one
+ * page per node, by adding fast-path arrays right after PGPROC entries on
+ * each node. But now we allocate fast-path locks separately - good enough
+ * for PoC.
+ *
+ * FIXME Should be conditional, but that was causing problems in bootstrap
+ * mode. Or maybe it was because the code that allocates stuff later does
+ * not do that conditionally. Anyway, needs to be fixed.
+ */
+ /* if (numa_procs_interleave) */
+ {
+ int num_nodes = numa_num_configured_nodes();
+ Size mem_page_size = get_memory_page_size();
+
+ size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+ }
+
return size;
}
@@ -191,11 +254,13 @@ ProcGlobalSemas(void)
void
InitProcGlobal(void)
{
- PGPROC *procs;
+ PGPROC **procs;
int i,
j;
bool found;
uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+ int procs_total;
+ int procs_per_node;
/* Used for setup of per-backend fast-path slots. */
char *fpPtr,
@@ -205,6 +270,8 @@ InitProcGlobal(void)
Size requestSize;
char *ptr;
+ Size mem_page_size = get_memory_page_size();
+
/* Create the ProcGlobal shared structure */
ProcGlobal = (PROC_HDR *)
ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -224,6 +291,9 @@ InitProcGlobal(void)
pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
+ /* one chunk per NUMA node (without NUMA assume 1 node) */
+ numa_nodes = numa_num_configured_nodes();
+
/*
* Create and initialize all the PGPROC structures we'll need. There are
* six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +311,108 @@ InitProcGlobal(void)
MemSet(ptr, 0, requestSize);
- procs = (PGPROC *) ptr;
- ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+ /* allprocs (array of pointers to PGPROC entries) */
+ procs = (PGPROC **) ptr;
+ ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
ProcGlobal->allProcs = procs;
/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
+ /*
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it now.
+ */
+ procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+ procs_total = 0;
+
+ /* build PGPROC entries for NUMA nodes */
+ for (i = 0; i < numa_nodes; i++)
+ {
+ PGPROC *procs_node;
+
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ int count_node = Min(procs_per_node, MaxBackends - procs_total);
+
+ /* make sure to align the PGPROC array to memory page */
+ ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+ /* allocate the PGPROC chunk for this node */
+ procs_node = (PGPROC *) ptr;
+ ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+ /* don't overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+ /* add pointers to the PGPROC entries to allProcs */
+ for (j = 0; j < count_node; j++)
+ {
+ procs_node[j].numa_node = i;
+ procs_node[j].procnumber = procs_total;
+
+ ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+ }
+
+ move_to_node((char *) procs_node, ptr, mem_page_size, i);
+ }
+
+ /*
+ * also build PGPROC entries for auxiliary procs / prepared xacts (we
+ * don't assign those to any NUMA node)
+ *
+ * XXX Mostly duplicate of preceding block, could be reused.
+ */
+ {
+ PGPROC *procs_node;
+ int count_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+ /*
+ * Make sure to align PGPROC array to memory page (it may not be
+ * aligned). We won't assign this to any NUMA node, but we still don't
+ * want it to interfere with the preceding chunk (for the last NUMA
+ * node).
+ */
+ ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+ procs_node = (PGPROC *) ptr;
+ ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+ /* don't overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+ /* now add the PGPROC pointers to allProcs */
+ for (j = 0; j < count_node; j++)
+ {
+ procs_node[j].numa_node = -1;
+ procs_node[j].procnumber = procs_total;
+
+ ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+ }
+ }
+
+ /* we should have allocated the expected number of PGPROC entries */
+ Assert(procs_total == TotalProcs);
+
/*
* Allocate arrays mirroring PGPROC fields in a dense manner. See
* PROC_HDR.
*
* XXX: It might make sense to increase padding for these arrays, given
* how hotly they are accessed.
+ *
+ * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+ * But those arrays are tiny, fit into a single memory page, so would need
+ * to be made more complex. Not sure.
*/
ProcGlobal->xids = (TransactionId *) ptr;
ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,23 +445,100 @@ InitProcGlobal(void)
/* For asserts checking we did not overflow. */
fpEndPtr = fpPtr + requestSize;
- for (i = 0; i < TotalProcs; i++)
+ /* reset the count */
+ procs_total = 0;
+
+ /*
+ * Mimic the same logic as above, but for fast-path locking.
+ */
+ for (i = 0; i < numa_nodes; i++)
{
- PGPROC *proc = &procs[i];
+ char *startptr;
+ char *endptr;
- /* Common initialization for all PGPROCs, regardless of type. */
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ int procs_node = Min(procs_per_node, MaxBackends - procs_total);
+
+ /* align to memory page, to make move_pages possible */
+ fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+ startptr = fpPtr;
+ endptr = fpPtr + procs_node * (fpLockBitsSize + fpRelIdSize);
+
+ move_to_node(startptr, endptr, mem_page_size, i);
/*
- * Set the fast-path lock arrays, and move the pointer. We interleave
- * the two arrays, to (hopefully) get some locality for each backend.
+ * Now point the PGPROC entries to the fast-path arrays, and also
+ * advance the fpPtr.
*/
- proc->fpLockBits = (uint64 *) fpPtr;
- fpPtr += fpLockBitsSize;
+ for (j = 0; j < procs_node; j++)
+ {
+ PGPROC *proc = ProcGlobal->allProcs[procs_total++];
+
+ /* cross-check we got the expected NUMA node */
+ Assert(proc->numa_node == i);
+ Assert(proc->procnumber == (procs_total - 1));
+
+ /*
+ * Set the fast-path lock arrays, and move the pointer. We
+ * interleave the two arrays, to (hopefully) get some locality for
+ * each backend.
+ */
+ proc->fpLockBits = (uint64 *) fpPtr;
+ fpPtr += fpLockBitsSize;
- proc->fpRelId = (Oid *) fpPtr;
- fpPtr += fpRelIdSize;
+ proc->fpRelId = (Oid *) fpPtr;
+ fpPtr += fpRelIdSize;
- Assert(fpPtr <= fpEndPtr);
+ Assert(fpPtr <= fpEndPtr);
+ }
+
+ Assert(fpPtr == endptr);
+ }
+
+ /* auxiliary processes / prepared xacts */
+ {
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ int procs_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+ /* align to memory page, to make move_pages possible */
+ fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+ /* now point the PGPROC entries to the fast-path arrays */
+ for (j = 0; j < procs_node; j++)
+ {
+ PGPROC *proc = ProcGlobal->allProcs[procs_total++];
+
+ /* cross-check we got PGPROC with no NUMA node assigned */
+ Assert(proc->numa_node == -1);
+ Assert(proc->procnumber == (procs_total - 1));
+
+ /*
+ * Set the fast-path lock arrays, and move the pointer. We
+ * interleave the two arrays, to (hopefully) get some locality for
+ * each backend.
+ */
+ proc->fpLockBits = (uint64 *) fpPtr;
+ fpPtr += fpLockBitsSize;
+
+ proc->fpRelId = (Oid *) fpPtr;
+ fpPtr += fpRelIdSize;
+
+ Assert(fpPtr <= fpEndPtr);
+ }
+ }
+
+ /* Should have consumed exactly the expected amount of fast-path memory. */
+ Assert(fpPtr <= fpEndPtr);
+
+ /* make sure we allocated the expected number of PGPROC entries */
+ Assert(procs_total == TotalProcs);
+
+ for (i = 0; i < TotalProcs; i++)
+ {
+ PGPROC *proc = procs[i];
+
+ Assert(proc->procnumber == i);
/*
* Set up per-PGPROC semaphore, latch, and fpInfoLock. Prepared xact
@@ -366,15 +602,12 @@ InitProcGlobal(void)
pg_atomic_init_u64(&(proc->waitStart), 0);
}
- /* Should have consumed exactly the expected amount of fast-path memory. */
- Assert(fpPtr == fpEndPtr);
-
/*
* Save pointers to the blocks of PGPROC structures reserved for auxiliary
* processes and prepared transactions.
*/
- AuxiliaryProcs = &procs[MaxBackends];
- PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+ AuxiliaryProcs = procs[MaxBackends];
+ PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
/* Create ProcStructLock spinlock, too */
ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +668,45 @@ InitProcess(void)
if (!dlist_is_empty(procgloballist))
{
- MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+ /*
+ * With numa interleaving of PGPROC, try to get a PROC entry from the
+ * right NUMA node (when the process starts).
+ *
+ * XXX The process may move to a different NUMA node later, but
+ * there's not much we can do about that.
+ */
+ if (numa_procs_interleave)
+ {
+ dlist_mutable_iter iter;
+ unsigned cpu;
+ unsigned node;
+ int rc;
+
+ rc = getcpu(&cpu, &node);
+ if (rc != 0)
+ elog(ERROR, "getcpu failed: %m");
+
+ MyProc = NULL;
+
+ dlist_foreach_modify(iter, procgloballist)
+ {
+ PGPROC *proc;
+
+ proc = dlist_container(PGPROC, links, iter.cur);
+
+ if (proc->numa_node == node)
+ {
+ MyProc = proc;
+ dlist_delete(iter.cur);
+ break;
+ }
+ }
+ }
+
+ /* didn't find PGPROC from the correct NUMA node, pick any free one */
+ if (MyProc == NULL)
+ MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
SpinLockRelease(ProcStructLock);
}
else
@@ -1988,7 +2259,7 @@ ProcSendSignal(ProcNumber procNumber)
if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
elog(ERROR, "procNumber out of range");
- SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+ SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
}
/*
@@ -2063,3 +2334,60 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+ Size os_page_size;
+ Size huge_page_size;
+
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ /*
+ * XXX This is a bit annoying/confusing, because we may get a different
+ * result depending on when we call it. Before mmap() we don't know if the
+ * huge pages get used, so we assume they will. And then if we don't get
+ * huge pages, we'll waste memory etc.
+ */
+
+ /* assume huge pages get used, unless HUGE_PAGES_OFF */
+ if (huge_pages_status == HUGE_PAGES_OFF)
+ huge_page_size = 0;
+ else
+ GetHugePageSize(&huge_page_size, NULL);
+
+ return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * move_to_node
+ * move all pages in the given range to the requested NUMA node
+ *
+ * XXX This is expected to only process fairly small number of pages, so no
+ * need to do batching etc. Just move pages one by one.
+ */
+static void
+move_to_node(char *startptr, char *endptr, Size mem_page_size, int node)
+{
+ while (startptr < endptr)
+ {
+ int r,
+ status;
+
+ r = numa_move_pages(0, 1, (void **) &startptr, &node, &status, 0);
+
+ if (r != 0)
+ elog(WARNING, "failed to move page to NUMA node %d (r = %d, status = %d)",
+ node, r, status);
+
+ startptr += mem_page_size;
+ }
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a11bc71a386..6ee4684d1b8 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int MaxBackends = 0;
bool numa_buffers_interleave = false;
bool numa_localalloc = false;
bool numa_partition_freelist = false;
+bool numa_procs_interleave = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0552ed62cc7..7b718760248 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2146,6 +2146,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+ gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+ },
+ &numa_procs_interleave,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 66baf2bf33e..cdeee8dccba 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT bool numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5cb1632718e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,6 +194,8 @@ struct PGPROC
* vacuum must not remove tuples deleted by
* xid >= xmin ! */
+ int procnumber; /* index in ProcGlobal->allProcs */
+
int pid; /* Backend's process ID; 0 if prepared xact */
int pgxactoff; /* offset into various ProcGlobal->arrays with
@@ -319,6 +321,9 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /* NUMA node */
+ int numa_node;
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -383,7 +388,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
typedef struct PROC_HDR
{
/* Array of PGPROC structures (not including dummies for prepared txns) */
- PGPROC *allProcs;
+ PGPROC **allProcs;
/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
TransactionId *xids;
@@ -435,8 +440,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
/*
* Accessors for getting PGPROC given a ProcNumber and vice versa.
*/
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
/*
* We set aside some extra PGPROC structures for "special worker" processes,
--
2.49.0
v2-0005-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v2-0005-NUMA-clockweep-partitioning.patchDownload
From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v2 5/7] NUMA: clockweep partitioning
---
src/backend/storage/buffer/bufmgr.c | 473 ++++++++++++++------------
src/backend/storage/buffer/freelist.c | 202 ++++++++---
src/include/storage/buf_internals.h | 4 +-
3 files changed, 424 insertions(+), 255 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5922689fe5d..3d6c834d77c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3587,6 +3587,23 @@ BufferSync(int flags)
TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
}
+/*
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
+ *
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
+ *
+ * XXX might be better to have a per-partition struct with all the info
+ */
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
/*
* BgBufferSync -- Write out some dirty buffers in the pool.
*
@@ -3602,55 +3619,24 @@ bool
BgBufferSync(WritebackContext *wb_context)
{
/* info obtained from freelist.c */
- int strategy_buf_id;
- uint32 strategy_passes;
uint32 recent_alloc;
+ uint32 recent_alloc_partition;
+ int num_partitions;
- /*
- * Information saved between calls so we can determine the strategy
- * point's advance rate and avoid scanning already-cleaned buffers.
- */
- static bool saved_info_valid = false;
- static int prev_strategy_buf_id;
- static uint32 prev_strategy_passes;
- static int next_to_clean;
- static uint32 next_passes;
-
- /* Moving averages of allocation rate and clean-buffer density */
- static float smoothed_alloc = 0;
- static float smoothed_density = 10.0;
-
- /* Potentially these could be tunables, but for now, not */
- float smoothing_samples = 16;
- float scan_whole_pool_milliseconds = 120000.0;
-
- /* Used to compute how far we scan ahead */
- long strategy_delta;
- int bufs_to_lap;
- int bufs_ahead;
- float scans_per_alloc;
- int reusable_buffers_est;
- int upcoming_alloc_est;
- int min_scan_buffers;
-
- /* Variables for the scanning loop proper */
- int num_to_scan;
- int num_written;
- int reusable_buffers;
+ /* assume we can hibernate, any partition can set to false */
+ bool hibernate = true;
- /* Variables for final smoothed_density update */
- long new_strategy_delta;
- uint32 new_recent_alloc;
+ /* get the number of clocksweep partitions, and total alloc count */
+ StrategySyncPrepare(&num_partitions, &recent_alloc);
- /*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
- */
- strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
+ Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
/* Report buffer alloc counts to pgstat */
PendingBgWriterStats.buf_alloc += recent_alloc;
+ /* average alloc buffers per partition */
+ recent_alloc_partition = (recent_alloc / num_partitions);
+
/*
* If we're not running the LRU scan, just stop after doing the stats
* stuff. We mark the saved state invalid so that we can recover sanely
@@ -3663,223 +3649,282 @@ BgBufferSync(WritebackContext *wb_context)
}
/*
- * Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
- * buffers we could scan before we'd catch up with it and "lap" it. Note:
- * weird-looking coding of xxx_passes comparisons are to avoid bogus
- * behavior when the passes counts wrap around.
- */
- if (saved_info_valid)
- {
- int32 passes_delta = strategy_passes - prev_strategy_passes;
-
- strategy_delta = strategy_buf_id - prev_strategy_buf_id;
- strategy_delta += (long) passes_delta * NBuffers;
+ * now process the clocksweep partitions, one by one, using the same
+ * cleanup that we used for all buffers
+ *
+ * XXX Maybe we should randomize the order of partitions a bit, so that
+ * we don't start from partition 0 all the time? Perhaps not entirely,
+ * but at least pick a random starting point?
+ */
+ for (int partition = 0; partition < num_partitions; partition++)
+ {
+ /* info obtained from freelist.c */
+ int strategy_buf_id;
+ uint32 strategy_passes;
+
+ /* Moving averages of allocation rate and clean-buffer density */
+ static float smoothed_alloc = 0;
+ static float smoothed_density = 10.0;
+
+ /* Potentially these could be tunables, but for now, not */
+ float smoothing_samples = 16;
+ float scan_whole_pool_milliseconds = 120000.0;
+
+ /* Used to compute how far we scan ahead */
+ long strategy_delta;
+ int bufs_to_lap;
+ int bufs_ahead;
+ float scans_per_alloc;
+ int reusable_buffers_est;
+ int upcoming_alloc_est;
+ int min_scan_buffers;
+
+ /* Variables for the scanning loop proper */
+ int num_to_scan;
+ int num_written;
+ int reusable_buffers;
+
+ /* Variables for final smoothed_density update */
+ long new_strategy_delta;
+ uint32 new_recent_alloc;
+
+ /* buffer range for the clocksweep partition */
+ int first_buffer;
+ int num_buffers;
- Assert(strategy_delta >= 0);
+ /*
+ * Find out where the freelist clock sweep currently is, and how many
+ * buffer allocations have happened since our last call.
+ */
+ strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+ &first_buffer, &num_buffers);
- if ((int32) (next_passes - strategy_passes) > 0)
+ /*
+ * Compute strategy_delta = how many buffers have been scanned by the
+ * clock sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock sweep, and if so, how many
+ * buffers we could scan before we'd catch up with it and "lap" it. Note:
+ * weird-looking coding of xxx_passes comparisons are to avoid bogus
+ * behavior when the passes counts wrap around.
+ */
+ if (saved_info_valid)
{
- /* we're one pass ahead of the strategy point */
- bufs_to_lap = strategy_buf_id - next_to_clean;
+ int32 passes_delta = strategy_passes - prev_strategy_passes[partition];
+
+ strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+ strategy_delta += (long) passes_delta * num_buffers;
+
+ Assert(strategy_delta >= 0);
+
+ if ((int32) (next_passes[partition] - strategy_passes) > 0)
+ {
+ /* we're one pass ahead of the strategy point */
+ bufs_to_lap = strategy_buf_id - next_to_clean[partition];
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
- next_passes, next_to_clean,
- strategy_passes, strategy_buf_id,
- strategy_delta, bufs_to_lap);
+ elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+ next_passes, next_to_clean,
+ strategy_passes, strategy_buf_id,
+ strategy_delta, bufs_to_lap);
#endif
- }
- else if (next_passes == strategy_passes &&
- next_to_clean >= strategy_buf_id)
- {
- /* on same pass, but ahead or at least not behind */
- bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+ }
+ else if (next_passes[partition] == strategy_passes &&
+ next_to_clean[partition] >= strategy_buf_id)
+ {
+ /* on same pass, but ahead or at least not behind */
+ bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
+#ifdef BGW_DEBUG
+ elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+ next_passes, next_to_clean,
+ strategy_passes, strategy_buf_id,
+ strategy_delta, bufs_to_lap);
+#endif
+ }
+ else
+ {
+ /*
+ * We're behind, so skip forward to the strategy point and start
+ * cleaning from there.
+ */
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
- next_passes, next_to_clean,
- strategy_passes, strategy_buf_id,
- strategy_delta, bufs_to_lap);
+ elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
+ next_passes, next_to_clean,
+ strategy_passes, strategy_buf_id,
+ strategy_delta);
#endif
+ next_to_clean[partition] = strategy_buf_id;
+ next_passes[partition] = strategy_passes;
+ bufs_to_lap = num_buffers;
+ }
}
else
{
/*
- * We're behind, so skip forward to the strategy point and start
- * cleaning from there.
+ * Initializing at startup or after LRU scanning had been off. Always
+ * start at the strategy point.
*/
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
- next_passes, next_to_clean,
- strategy_passes, strategy_buf_id,
- strategy_delta);
+ elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
+ strategy_passes, strategy_buf_id);
#endif
- next_to_clean = strategy_buf_id;
- next_passes = strategy_passes;
- bufs_to_lap = NBuffers;
+ strategy_delta = 0;
+ next_to_clean[partition] = strategy_buf_id;
+ next_passes[partition] = strategy_passes;
+ bufs_to_lap = num_buffers;
}
- }
- else
- {
- /*
- * Initializing at startup or after LRU scanning had been off. Always
- * start at the strategy point.
- */
-#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
- strategy_passes, strategy_buf_id);
-#endif
- strategy_delta = 0;
- next_to_clean = strategy_buf_id;
- next_passes = strategy_passes;
- bufs_to_lap = NBuffers;
- }
- /* Update saved info for next time */
- prev_strategy_buf_id = strategy_buf_id;
- prev_strategy_passes = strategy_passes;
- saved_info_valid = true;
+ /* Update saved info for next time */
+ prev_strategy_buf_id[partition] = strategy_buf_id;
+ prev_strategy_passes[partition] = strategy_passes;
+ // FIXME has to happen after all partitions
+ // saved_info_valid = true;
- /*
- * Compute how many buffers had to be scanned for each new allocation, ie,
- * 1/density of reusable buffers, and track a moving average of that.
- *
- * If the strategy point didn't move, we don't update the density estimate
- */
- if (strategy_delta > 0 && recent_alloc > 0)
- {
- scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
- smoothed_density += (scans_per_alloc - smoothed_density) /
- smoothing_samples;
- }
+ /*
+ * Compute how many buffers had to be scanned for each new allocation, ie,
+ * 1/density of reusable buffers, and track a moving average of that.
+ *
+ * If the strategy point didn't move, we don't update the density estimate
+ */
+ if (strategy_delta > 0 && recent_alloc_partition > 0)
+ {
+ scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
+ smoothed_density += (scans_per_alloc - smoothed_density) /
+ smoothing_samples;
+ }
- /*
- * Estimate how many reusable buffers there are between the current
- * strategy point and where we've scanned ahead to, based on the smoothed
- * density estimate.
- */
- bufs_ahead = NBuffers - bufs_to_lap;
- reusable_buffers_est = (float) bufs_ahead / smoothed_density;
+ /*
+ * Estimate how many reusable buffers there are between the current
+ * strategy point and where we've scanned ahead to, based on the smoothed
+ * density estimate.
+ */
+ bufs_ahead = num_buffers - bufs_to_lap;
+ reusable_buffers_est = (float) bufs_ahead / smoothed_density;
- /*
- * Track a moving average of recent buffer allocations. Here, rather than
- * a true average we want a fast-attack, slow-decline behavior: we
- * immediately follow any increase.
- */
- if (smoothed_alloc <= (float) recent_alloc)
- smoothed_alloc = recent_alloc;
- else
- smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
- smoothing_samples;
+ /*
+ * Track a moving average of recent buffer allocations. Here, rather than
+ * a true average we want a fast-attack, slow-decline behavior: we
+ * immediately follow any increase.
+ */
+ if (smoothed_alloc <= (float) recent_alloc_partition)
+ smoothed_alloc = recent_alloc_partition;
+ else
+ smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
+ smoothing_samples;
- /* Scale the estimate by a GUC to allow more aggressive tuning. */
- upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
+ /* Scale the estimate by a GUC to allow more aggressive tuning. */
+ upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
- /*
- * If recent_alloc remains at zero for many cycles, smoothed_alloc will
- * eventually underflow to zero, and the underflows produce annoying
- * kernel warnings on some platforms. Once upcoming_alloc_est has gone to
- * zero, there's no point in tracking smaller and smaller values of
- * smoothed_alloc, so just reset it to exactly zero to avoid this
- * syndrome. It will pop back up as soon as recent_alloc increases.
- */
- if (upcoming_alloc_est == 0)
- smoothed_alloc = 0;
+ /*
+ * If recent_alloc remains at zero for many cycles, smoothed_alloc will
+ * eventually underflow to zero, and the underflows produce annoying
+ * kernel warnings on some platforms. Once upcoming_alloc_est has gone to
+ * zero, there's no point in tracking smaller and smaller values of
+ * smoothed_alloc, so just reset it to exactly zero to avoid this
+ * syndrome. It will pop back up as soon as recent_alloc increases.
+ */
+ if (upcoming_alloc_est == 0)
+ smoothed_alloc = 0;
- /*
- * Even in cases where there's been little or no buffer allocation
- * activity, we want to make a small amount of progress through the buffer
- * cache so that as many reusable buffers as possible are clean after an
- * idle period.
- *
- * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
- * the BGW will be called during the scan_whole_pool time; slice the
- * buffer pool into that many sections.
- */
- min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+ /*
+ * Even in cases where there's been little or no buffer allocation
+ * activity, we want to make a small amount of progress through the buffer
+ * cache so that as many reusable buffers as possible are clean after an
+ * idle period.
+ *
+ * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
+ * the BGW will be called during the scan_whole_pool time; slice the
+ * buffer pool into that many sections.
+ */
+ min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
- if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
- {
+ if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
+ {
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
- upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
+ elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
+ upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
#endif
- upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
- }
-
- /*
- * Now write out dirty reusable buffers, working forward from the
- * next_to_clean point, until we have lapped the strategy scan, or cleaned
- * enough buffers to match our estimate of the next cycle's allocation
- * requirements, or hit the bgwriter_lru_maxpages limit.
- */
+ upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
+ }
- num_to_scan = bufs_to_lap;
- num_written = 0;
- reusable_buffers = reusable_buffers_est;
+ /*
+ * Now write out dirty reusable buffers, working forward from the
+ * next_to_clean point, until we have lapped the strategy scan, or cleaned
+ * enough buffers to match our estimate of the next cycle's allocation
+ * requirements, or hit the bgwriter_lru_maxpages limit.
+ */
- /* Execute the LRU scan */
- while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
- {
- int sync_state = SyncOneBuffer(next_to_clean, true,
- wb_context);
+ num_to_scan = bufs_to_lap;
+ num_written = 0;
+ reusable_buffers = reusable_buffers_est;
- if (++next_to_clean >= NBuffers)
+ /* Execute the LRU scan */
+ while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
- next_to_clean = 0;
- next_passes++;
- }
- num_to_scan--;
+ int sync_state = SyncOneBuffer(next_to_clean[partition], true,
+ wb_context);
- if (sync_state & BUF_WRITTEN)
- {
- reusable_buffers++;
- if (++num_written >= bgwriter_lru_maxpages)
+ if (++next_to_clean[partition] >= (first_buffer + num_buffers))
{
- PendingBgWriterStats.maxwritten_clean++;
- break;
+ next_to_clean[partition] = first_buffer;
+ next_passes[partition]++;
+ }
+ num_to_scan--;
+
+ if (sync_state & BUF_WRITTEN)
+ {
+ reusable_buffers++;
+ if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
+ {
+ PendingBgWriterStats.maxwritten_clean++;
+ break;
+ }
}
+ else if (sync_state & BUF_REUSABLE)
+ reusable_buffers++;
}
- else if (sync_state & BUF_REUSABLE)
- reusable_buffers++;
- }
- PendingBgWriterStats.buf_written_clean += num_written;
+ PendingBgWriterStats.buf_written_clean += num_written;
#ifdef BGW_DEBUG
- elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
- recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
- smoothed_density, reusable_buffers_est, upcoming_alloc_est,
- bufs_to_lap - num_to_scan,
- num_written,
- reusable_buffers - reusable_buffers_est);
+ elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
+ recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
+ smoothed_density, reusable_buffers_est, upcoming_alloc_est,
+ bufs_to_lap - num_to_scan,
+ num_written,
+ reusable_buffers - reusable_buffers_est);
#endif
- /*
- * Consider the above scan as being like a new allocation scan.
- * Characterize its density and update the smoothed one based on it. This
- * effectively halves the moving average period in cases where both the
- * strategy and the background writer are doing some useful scanning,
- * which is helpful because a long memory isn't as desirable on the
- * density estimates.
- */
- new_strategy_delta = bufs_to_lap - num_to_scan;
- new_recent_alloc = reusable_buffers - reusable_buffers_est;
- if (new_strategy_delta > 0 && new_recent_alloc > 0)
- {
- scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
- smoothed_density += (scans_per_alloc - smoothed_density) /
- smoothing_samples;
+ /*
+ * Consider the above scan as being like a new allocation scan.
+ * Characterize its density and update the smoothed one based on it. This
+ * effectively halves the moving average period in cases where both the
+ * strategy and the background writer are doing some useful scanning,
+ * which is helpful because a long memory isn't as desirable on the
+ * density estimates.
+ */
+ new_strategy_delta = bufs_to_lap - num_to_scan;
+ new_recent_alloc = reusable_buffers - reusable_buffers_est;
+ if (new_strategy_delta > 0 && new_recent_alloc > 0)
+ {
+ scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
+ smoothed_density += (scans_per_alloc - smoothed_density) /
+ smoothing_samples;
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
- new_recent_alloc, new_strategy_delta,
- scans_per_alloc, smoothed_density);
+ elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
+ new_recent_alloc, new_strategy_delta,
+ scans_per_alloc, smoothed_density);
#endif
+ }
+
+ /* hibernate if all partitions can hibernate */
+ hibernate &= (bufs_to_lap == 0 && recent_alloc_partition == 0);
}
+ /* now that we've scanned all partitions, mark the cached info as valid */
+ saved_info_valid = true;
+
/* Return true if OK to hibernate */
- return (bufs_to_lap == 0 && recent_alloc == 0);
+ return hibernate;
}
/*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e38e5c7ec3d..1827e052da7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -63,17 +63,27 @@ typedef struct BufferStrategyFreelist
#define MIN_FREELIST_PARTITIONS 4
/*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
*/
typedef struct
{
/* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
+ slock_t clock_sweep_lock;
+
+ /* range for this clock weep partition */
+ int32 firstBuffer;
+ int32 numBuffers;
/*
* Clock sweep hand: index of next buffer to consider grabbing. Note that
* this isn't a concrete buffer - we only ever increase the value. So, to
* get an actual buffer, it needs to be used modulo NBuffers.
+ *
+ * XXX This is relative to firstBuffer, so needs to be offset properly.
+ *
+ * XXX firstBuffer + (nextVictimBuffer % numBuffers)
*/
pg_atomic_uint32 nextVictimBuffer;
@@ -83,6 +93,15 @@ typedef struct
*/
uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+ /* Spinlock: protects the values below */
+ slock_t buffer_strategy_lock;
/*
* Bgworker process to be notified upon activity or -1 if none. See
@@ -99,6 +118,9 @@ typedef struct
int num_partitions_groups; /* effectively num of NUMA nodes */
int num_partitions_per_group;
+ /* clocksweep partitions */
+ ClockSweep *sweeps;
+
BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
} BufferStrategyControl;
@@ -152,6 +174,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -163,6 +186,7 @@ static inline uint32
ClockSweepTick(void)
{
uint32 victim;
+ ClockSweep *sweep = ChooseClockSweep();
/*
* Atomically move hand ahead one buffer - if there's several processes
@@ -170,14 +194,14 @@ ClockSweepTick(void)
* apparent order.
*/
victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+ pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
- if (victim >= NBuffers)
+ if (victim >= sweep->numBuffers)
{
uint32 originalVictim = victim;
/* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ victim = victim % sweep->numBuffers;
/*
* If we're the one that just caused a wraparound, force
@@ -203,19 +227,23 @@ ClockSweepTick(void)
* could lead to an overflow of nextVictimBuffers, but that's
* highly unlikely and wouldn't be particularly harmful.
*/
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ SpinLockAcquire(&sweep->clock_sweep_lock);
- wrapped = expected % NBuffers;
+ wrapped = expected % sweep->numBuffers;
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+ success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
&expected, wrapped);
if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ sweep->completePasses++;
+ SpinLockRelease(&sweep->clock_sweep_lock);
}
}
}
- return victim;
+
+ /* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+ Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+ return sweep->firstBuffer + victim;
}
/*
@@ -289,6 +317,28 @@ calculate_partition_index()
return index;
}
+/*
+ * ChooseClockSweep
+ * pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+ int index = calculate_partition_index();
+
+ return &StrategyControl->sweeps[index];
+}
+
/*
* ChooseFreeList
* Pick the buffer freelist to use, depending on the CPU and NUMA node.
@@ -404,7 +454,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* the rate of buffer consumption. Note that buffers recycled by a
* strategy object are intentionally not counted here.
*/
- pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+ pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
/*
* First check, without acquiring the lock, whether there's buffers in the
@@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
/*
* Nothing on the freelist, so run the "clock sweep" algorithm
*
- * XXX Should we also make this NUMA-aware, to only access buffers from
- * the same NUMA node? That'd probably mean we need to make the clock
- * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
- * subset of buffers. But that also means each process could "sweep" only
- * a fraction of buffers, even if the other buffers are better candidates
- * for eviction. Would that also mean we'd have multiple bgwriters, one
- * for each node, or would one bgwriter handle all of that?
+ * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+ * buffers from a single partition, aligned with the NUMA node. That
+ * means it only accesses buffers from the same NUMA node.
+ *
+ * XXX That also means each process "sweeps" only a fraction of buffers,
+ * even if the other buffers are better candidates for eviction. Maybe
+ * there should be some logic to "steal" buffers from other freelists
+ * or other nodes?
+ *
+ * XXX Would that also mean we'd have multiple bgwriters, one for each
+ * node, or would one bgwriter handle all of that?
*/
trycounter = NBuffers;
for (;;)
@@ -563,6 +617,41 @@ StrategyFreeBuffer(BufferDesc *buf)
SpinLockRelease(&freelist->freelist_lock);
}
+/*
+ * StrategySyncStart -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+ *num_buf_alloc = 0;
+ *num_parts = StrategyControl->num_partitions;
+
+ /*
+ * We lock the partitions one by one, so not exacly in sync, but that
+ * should be fine. We're only looking for heuristics anyway.
+ */
+ for (int i = 0; i < StrategyControl->num_partitions; i++)
+ {
+ ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+ SpinLockAcquire(&sweep->clock_sweep_lock);
+ if (num_buf_alloc)
+ {
+ *num_buf_alloc += pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+ }
+ SpinLockRelease(&sweep->clock_sweep_lock);
+ }
+}
+
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -570,37 +659,44 @@ StrategyFreeBuffer(BufferDesc *buf)
* BgBufferSync() will proceed circularly around the buffer array from there.
*
* In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed. The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
*/
int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+ int *first_buffer, int *num_buffers)
{
uint32 nextVictimBuffer;
int result;
+ ClockSweep *sweep = &StrategyControl->sweeps[partition];
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+ SpinLockAcquire(&sweep->clock_sweep_lock);
+ nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+ result = nextVictimBuffer % sweep->numBuffers;
+
+ *first_buffer = sweep->firstBuffer;
+ *num_buffers = sweep->numBuffers;
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
+ *complete_passes = sweep->completePasses;
/*
* Additionally add the number of wraparounds that happened before
* completePasses could be incremented. C.f. ClockSweepTick().
*/
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes += nextVictimBuffer / sweep->numBuffers;
}
+ SpinLockRelease(&sweep->clock_sweep_lock);
- if (num_buf_alloc)
- {
- *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
- }
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- return result;
+ /* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+ Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+ return sweep->firstBuffer + result;
}
/*
@@ -696,6 +792,10 @@ StrategyShmemSize(void)
size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
num_partitions)));
+ /* size of clocksweep partitions (at least one per NUMA node) */
+ size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+ num_partitions)));
+
return size;
}
@@ -714,6 +814,7 @@ StrategyInitialize(bool init)
int num_partitions;
int num_partitions_per_group;
+ char *ptr;
/* */
num_partitions = calculate_partition_count(strategy_nnodes);
@@ -736,7 +837,8 @@ StrategyInitialize(bool init)
StrategyControl = (BufferStrategyControl *)
ShmemInitStruct("Buffer Strategy Status",
MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
- MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
+ MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions) +
+ MAXALIGN(sizeof(ClockSweep) * num_partitions),
&found);
if (!found)
@@ -758,12 +860,32 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* have to point the sweeps array to right after the freelists */
+ ptr = (char *) StrategyControl +
+ MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+ MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions);
+ StrategyControl->sweeps = (ClockSweep *) ptr;
- /* Clear statistics */
- StrategyControl->completePasses = 0;
- pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+ /* Initialize the clock sweep pointers (for all partitions) */
+ for (int i = 0; i < num_partitions; i++)
+ {
+ SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+ pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
+
+ /*
+ * FIXME This may not quite right, because if NBuffers is not
+ * a perfect multiple of numBuffers, the last partition will have
+ * numBuffers set too high. buf_init handles this by tracking the
+ * remaining number of buffers, and not overflowing.
+ */
+ StrategyControl->sweeps[i].numBuffers = numBuffers;
+ StrategyControl->sweeps[i].firstBuffer = (numBuffers * i);
+
+ /* Clear statistics */
+ StrategyControl->sweeps[i].completePasses = 0;
+ pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+ }
/* No pending notification */
StrategyControl->bgwprocno = -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..b50f9458156 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -448,7 +448,9 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int StrategySyncStart(int partition, uint32 *complete_passes,
+ int *first_buffer, int *num_buffers);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
v2-0004-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v2-0004-NUMA-partition-buffer-freelist.patchDownload
From d67278a64983b5f2eb5e408a51e9516aa8fd2264 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v2 4/7] NUMA: partition buffer freelist
Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.
There are four strategies, specified by GUC numa_partition_freelist
* none - single long freelist, should work just like now
* node - one freelist per NUMA node, with only buffers from that node
* cpu - one freelist per CPU
* pid - freelist determined by PID (same number of freelists as 'cpu')
When allocating a buffer, it's taken from the correct freelist (e.g.
same NUMA node).
Note: This is (probably) more important than partitioning ProcArray.
---
src/backend/storage/buffer/buf_init.c | 4 +-
src/backend/storage/buffer/freelist.c | 372 ++++++++++++++++++++++++--
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/miscadmin.h | 1 +
src/include/storage/bufmgr.h | 8 +
6 files changed, 367 insertions(+), 29 deletions(-)
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 2ad34624c49..920f1a32a8f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -543,8 +543,8 @@ pg_numa_interleave_memory(char *startptr, char *endptr,
* XXX no return value, to make this fail on error, has to use
* numa_set_strict
*
- * XXX Should we still touch the memory first, like with numa_move_pages,
- * or is that not necessary?
+ * XXX Should we still touch the memory first, like with
+ * numa_move_pages, or is that not necessary?
*/
numa_tonode_memory(ptr, sz, node);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..e38e5c7ec3d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,52 @@
*/
#include "postgres.h"
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
#include "pgstat.h"
#include "port/atomics.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/proc.h"
#define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var))))
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+ /* Spinlock: protects the values below */
+ slock_t freelist_lock;
+
+ /*
+ * XXX Not sure why this needs to be aligned like this. Need to ask
+ * Andres.
+ */
+ int firstFreeBuffer __attribute__((aligned(64))); /* Head of list of
+ * unused buffers */
+
+ /* Number of buffers consumed from this list. */
+ uint64 consumed;
+} BufferStrategyFreelist;
+
+/*
+ * The minimum number of partitions we want to have. We want at least this
+ * number of partitions, even on non-NUMA system, as it helps with contention
+ * for buffers. But with multiple NUMA nodes, we want a separate partition per
+ * node. But we may get multiple partitions per node, for low node count.
+ *
+ * With multiple partitions per NUMA node, we pick the partition based on CPU
+ * (or some other parameter).
+ */
+#define MIN_FREELIST_PARTITIONS 4
/*
* The shared freelist control information.
@@ -39,8 +77,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -51,13 +87,38 @@ typedef struct
/*
* Bgworker process to be notified upon activity or -1 if none. See
* StrategyNotifyBgWriter.
+ *
+ * XXX Not sure why this needs to be aligned like this. Need to ask
+ * Andres. Also, shouldn't the alignment be specified after, like for
+ * "consumed"?
*/
- int bgwprocno;
+ int __attribute__((aligned(64))) bgwprocno;
+
+ /* info about freelist partitioning */
+ int num_partitions;
+ int num_partitions_groups; /* effectively num of NUMA nodes */
+ int num_partitions_per_group;
+
+ BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
} BufferStrategyControl;
/* Pointers to shared state */
static BufferStrategyControl *StrategyControl = NULL;
+/*
+ * XXX shouldn't this be in BufferStrategyControl? Probably not, we need to
+ * calculate it during sizing, and perhaps it could change before the memory
+ * gets allocated (so we need to remember the values).
+ *
+ * XXX We should probably have a fixed number of partitions, and map the
+ * NUMA nodes to them, somehow (i.e. each node would get some subset of
+ * partitions). Similar to NUM_LOCK_PARTITIONS.
+ *
+ * XXX We don't use the ncpus, really.
+ */
+static int strategy_ncpus;
+static int strategy_nnodes;
+
/*
* Private (non-shared) state for managing a ring of shared buffers to re-use.
* This is currently the only kind of BufferAccessStrategy object, but someday
@@ -157,6 +218,104 @@ ClockSweepTick(void)
return victim;
}
+/*
+ * Size the clocksweep partitions. At least one partition per NUMA node,
+ * but at least MIN_FREELIST_PARTITIONS partitions in total.
+*/
+static int
+calculate_partition_count(int num_nodes)
+{
+ int num_per_node = 1;
+
+ while (num_per_node * num_nodes < MIN_FREELIST_PARTITIONS)
+ num_per_node++;
+
+ return (num_nodes * num_per_node);
+}
+
+static int
+calculate_partition_index()
+{
+ int rc;
+ unsigned cpu;
+ unsigned node;
+ int index;
+
+ Assert(StrategyControl->num_partitions_groups == strategy_nnodes);
+
+ Assert(StrategyControl->num_partitions ==
+ (strategy_nnodes * StrategyControl->num_partitions_per_group));
+
+ /*
+ * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+ * list based on that.
+ */
+ rc = getcpu(&cpu, &node);
+ if (rc != 0)
+ elog(ERROR, "getcpu failed: %m");
+
+ /*
+ * XXX We should't get nodes that we haven't considered while building
+ * the partitions. Maybe if we allow this (e.g. due to support adjusting
+ * the NUMA stuff at runtime), we should just do our best to minimize
+ * the conflicts somehow. But it'll make the mapping harder, so for now
+ * we ignore it.
+ */
+ if (node > strategy_nnodes)
+ elog(ERROR, "node out of range: %d > %u", cpu, strategy_nnodes);
+
+ /*
+ * Find the partition. If we have a single partition per node, we can
+ * calculate the index directly from node. Otherwise we need to do two
+ * steps, using node and then cpu.
+ */
+ if (StrategyControl->num_partitions_per_group == 1)
+ {
+ index = (node % StrategyControl->num_partitions);
+ }
+ else
+ {
+ int index_group,
+ index_part;
+
+ /* two steps - calculate group from node, partition from cpu */
+ index_group = (node % StrategyControl->num_partitions_groups);
+ index_part = (cpu % StrategyControl->num_partitions_per_group);
+
+ index = (index_group * StrategyControl->num_partitions_per_group)
+ + index_part;
+ }
+
+ return index;
+}
+
+/*
+ * ChooseFreeList
+ * Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+ int index = calculate_partition_index();
+ return &StrategyControl->freelists[index];
+}
+
/*
* have_free_buffer -- a lockless check to see if there is a free buffer in
* buffer pool.
@@ -168,10 +327,13 @@ ClockSweepTick(void)
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
+ for (int i = 0; i < strategy_nnodes; i++)
+ {
+ if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+ return true;
+ }
+
+ return false;
}
/*
@@ -193,6 +355,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
int bgwprocno;
int trycounter;
uint32 local_buf_state; /* to avoid repeated (de-)referencing */
+ BufferStrategyFreelist *freelist;
*from_ring = false;
@@ -259,31 +422,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
* manipulate them without holding the spinlock.
*/
- if (StrategyControl->firstFreeBuffer >= 0)
+ freelist = ChooseFreeList();
+ if (freelist->firstFreeBuffer >= 0)
{
while (true)
{
/* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ SpinLockAcquire(&freelist->freelist_lock);
- if (StrategyControl->firstFreeBuffer < 0)
+ if (freelist->firstFreeBuffer < 0)
{
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
break;
}
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+ buf = GetBufferDescriptor(freelist->firstFreeBuffer);
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
/* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
+ freelist->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
+ /* increment number of buffers we consumed from this list */
+ freelist->consumed++;
+
/*
* Release the lock so someone else can access the freelist while
* we check out this buffer.
*/
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +472,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /*
+ * Nothing on the freelist, so run the "clock sweep" algorithm
+ *
+ * XXX Should we also make this NUMA-aware, to only access buffers from
+ * the same NUMA node? That'd probably mean we need to make the clock
+ * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+ * subset of buffers. But that also means each process could "sweep" only
+ * a fraction of buffers, even if the other buffers are better candidates
+ * for eviction. Would that also mean we'd have multiple bgwriters, one
+ * for each node, or would one bgwriter handle all of that?
+ */
trycounter = NBuffers;
for (;;)
{
@@ -356,7 +533,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
void
StrategyFreeBuffer(BufferDesc *buf)
{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ BufferStrategyFreelist *freelist;
+
+ /*
+ * We don't want to call ChooseFreeList() again, because we might get a
+ * completely different freelist - either a different partition in the
+ * same group, or even a different group if the NUMA node changed. But
+ * we can calculate the proper freelist from the buffer id.
+ */
+ int index = (BufferGetNode(buf->buf_id) * StrategyControl->num_partitions_per_group)
+ + (buf->buf_id % StrategyControl->num_partitions_per_group);
+
+ Assert((index >= 0) && (index < StrategyControl->num_partitions));
+
+ freelist = &StrategyControl->freelists[index];
+
+ SpinLockAcquire(&freelist->freelist_lock);
/*
* It is possible that we are told to put something in the freelist that
@@ -364,11 +556,11 @@ StrategyFreeBuffer(BufferDesc *buf)
*/
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
{
- buf->freeNext = StrategyControl->firstFreeBuffer;
- StrategyControl->firstFreeBuffer = buf->buf_id;
+ buf->freeNext = freelist->firstFreeBuffer;
+ freelist->firstFreeBuffer = buf->buf_id;
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
}
/*
@@ -432,6 +624,42 @@ StrategyNotifyBgWriter(int bgwprocno)
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+ for (int node = 0; node < strategy_nnodes; node++)
+ {
+ BufferStrategyFreelist *freelist = &StrategyControl->freelists[node];
+ uint64 remain = 0;
+ uint64 actually_free = 0;
+ int cur = freelist->firstFreeBuffer;
+
+ while (cur >= 0)
+ {
+ uint32 local_buf_state;
+ BufferDesc *buf;
+
+ buf = GetBufferDescriptor(cur);
+
+ remain++;
+
+ local_buf_state = LockBufHdr(buf);
+
+ if (!(local_buf_state & BM_TAG_VALID))
+ actually_free++;
+
+ UnlockBufHdr(buf, local_buf_state);
+
+ cur = buf->freeNext;
+ }
+ elog(LOG, "freelist %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+ node,
+ freelist->firstFreeBuffer,
+ freelist->consumed,
+ remain, actually_free);
+ }
+}
/*
* StrategyShmemSize
@@ -445,12 +673,28 @@ Size
StrategyShmemSize(void)
{
Size size = 0;
+ int num_partitions;
+
+ /* FIXME */
+#ifdef USE_LIBNUMA
+ strategy_ncpus = numa_num_task_cpus();
+ strategy_nnodes = numa_num_task_nodes();
+#else
+ strategy_ncpus = 1;
+ strategy_nnodes = 1;
+#endif
+
+ num_partitions = calculate_partition_count(strategy_nnodes);
/* size of lookup hash table ... see comment in StrategyInitialize */
size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
/* size of the shared replacement strategy control block */
- size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+ size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+ /* size of freelist partitions (at least one per NUMA node) */
+ size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+ num_partitions)));
return size;
}
@@ -466,6 +710,13 @@ void
StrategyInitialize(bool init)
{
bool found;
+ int buffers_per_partition;
+
+ int num_partitions;
+ int num_partitions_per_group;
+
+ /* */
+ num_partitions = calculate_partition_count(strategy_nnodes);
/*
* Initialize the shared buffer lookup hashtable.
@@ -484,23 +735,28 @@ StrategyInitialize(bool init)
*/
StrategyControl = (BufferStrategyControl *)
ShmemInitStruct("Buffer Strategy Status",
- sizeof(BufferStrategyControl),
+ MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+ MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
&found);
if (!found)
{
+ int32 numBuffers = NBuffers / num_partitions;
+
+ while (numBuffers * num_partitions < NBuffers)
+ numBuffers++;
+
+ Assert(numBuffers * num_partitions == NBuffers);
+
/*
* Only done once, usually in postmaster
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
+ /* register callback to dump some stats on exit */
+ before_shmem_exit(freelist_before_shmem_exit, 0);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
+ SpinLockInit(&StrategyControl->buffer_strategy_lock);
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +767,68 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwprocno = -1;
+
+ /* always a multiple of NUMA nodes */
+ Assert(num_partitions % strategy_nnodes == 0);
+
+ num_partitions_per_group = (num_partitions / strategy_nnodes);
+
+ /* initialize the partitioned clocksweep */
+ StrategyControl->num_partitions = num_partitions;
+ StrategyControl->num_partitions_groups = strategy_nnodes;
+ StrategyControl->num_partitions_per_group = num_partitions_per_group;
+
+ /*
+ * Rebuild the freelist - right now all buffers are in one huge list,
+ * we want to rework that into multiple lists. Start by initializing
+ * the strategy to have empty lists.
+ */
+ for (int nfreelist = 0; nfreelist < num_partitions; nfreelist++)
+ {
+ BufferStrategyFreelist *freelist;
+
+ freelist = &StrategyControl->freelists[nfreelist];
+
+ freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+ SpinLockInit(&freelist->freelist_lock);
+ }
+
+ /* buffers per partition */
+ buffers_per_partition = (NBuffers / num_partitions);
+
+ elog(LOG, "NBuffers: %d, nodes %d, ncpus: %d, divide: %d, remain: %d",
+ NBuffers, strategy_nnodes, strategy_ncpus,
+ buffers_per_partition, NBuffers - (num_partitions * buffers_per_partition));
+
+ /*
+ * Walk through the buffers, add them to the correct list. Walk from
+ * the end, because we're adding the buffers to the beginning.
+ */
+ for (int i = NBuffers - 1; i >= 0; i--)
+ {
+ BufferDesc *buf = GetBufferDescriptor(i);
+ BufferStrategyFreelist *freelist;
+ int node;
+ int index;
+
+ /*
+ * Split the freelist into partitions, if needed (or just keep the
+ * freelist we already built in BufferManagerShmemInit().
+ */
+
+ /* determine NUMA node for buffer, this determines the group */
+ node = BufferGetNode(i);
+
+ /* now calculate the actual freelist index */
+ index = node * num_partitions_per_group + (i % num_partitions_per_group);
+
+ /* add to the right freelist */
+ freelist = &StrategyControl->freelists[index];
+
+ buf->freeNext = freelist->firstFreeBuffer;
+ freelist->firstFreeBuffer = i;
+ }
}
else
Assert(!init);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..a11bc71a386 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int MaxBackends = 0;
/* NUMA stuff */
bool numa_buffers_interleave = false;
bool numa_localalloc = false;
+bool numa_partition_freelist = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a21f20800fb..0552ed62cc7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2136,6 +2136,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_partition_freelist", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+ gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+ },
+ &numa_partition_freelist,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..66baf2bf33e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT bool numa_partition_freelist;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c257c8a1c20..efb7e28c10f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -93,6 +93,14 @@ typedef enum ExtendBufferedFlags
EB_LOCK_TARGET = (1 << 5),
} ExtendBufferedFlags;
+typedef enum FreelistPartitionMode
+{
+ FREELIST_PARTITION_NONE,
+ FREELIST_PARTITION_NODE,
+ FREELIST_PARTITION_CPU,
+ FREELIST_PARTITION_PID,
+} FreelistPartitionMode;
+
/*
* Some functions identify relations either by relation or smgr +
* relpersistence. Used via the BMR_REL()/BMR_SMGR() macros below. This
--
2.49.0
v2-0003-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v2-0003-freelist-Don-t-track-tail-of-a-freelist.patchDownload
From 2faefc2d10dcd9e31e96be5565e82d1904bd7280 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v2 3/7] freelist: Don't track tail of a freelist
The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
src/backend/storage/buffer/freelist.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
pg_atomic_uint32 nextVictimBuffer;
int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
/*
* Statistics. These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
{
buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
}
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
* assume it was previously set up by BufferManagerShmemInit().
*/
StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
--
2.49.0
v2-0002-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v2-0002-NUMA-localalloc.patchDownload
From c0acd3385fa961e56eb435b85bb021e7ce9e2cb8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v2 2/7] NUMA: localalloc
Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).
This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.
XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
src/backend/utils/init/globals.c | 1 +
src/backend/utils/init/miscinit.c | 16 ++++++++++++++++
src/backend/utils/misc/guc_tables.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
4 files changed, 28 insertions(+)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int MaxBackends = 0;
/* NUMA stuff */
bool numa_buffers_interleave = false;
+bool numa_localalloc = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 43b4dbccc3d..d11936691b2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
#include <arpa/inet.h>
#include <utime.h>
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
@@ -164,6 +168,18 @@ InitPostmasterChild(void)
(errcode_for_socket_access(),
errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
#endif
+
+#ifdef USE_LIBNUMA
+ /*
+ * Set the default allocation policy to local node, where the task is
+ * executing at the time of a page fault.
+ *
+ * XXX I believe this is not necessary, now that we don't use automatic
+ * interleaving (numa_set_interleave_mask).
+ */
+ if (numa_localalloc)
+ numa_set_localalloc();
+#endif
}
/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9570087aa60..a21f20800fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables setting the default allocation policy to local node."),
+ gettext_noop("When enabled, allocate from the node where the task is executing."),
+ },
+ &numa_localalloc,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
--
2.49.0
v2-0001-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v2-0001-NUMA-interleaving-buffers.patchDownload
From 1eab6285dab1fdc78d80f6054ec3278624a662f1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 6 May 2025 21:12:21 +0200
Subject: [PATCH v2 1/7] NUMA: interleaving buffers
Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).
The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc. It's less dependent on what the CPU
scheduler does, etc.
Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.
The effect is similar to
numactl --interleave=all
but there's a number of important differences.
Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).
Secondly, it considers the page and block size, and makes sure not to
split a buffer on different NUMA nodes (which with the regular
interleaving is guaranteed to happen, unless when using huge pages). The
patch performs "explicit" interleaving, so that buffers are not split
like this.
The patch maps both buffers and buffer descriptors, so that the buffer
and it's buffer descriptor end up on the same NUMA node.
The mapping happens in larger chunks (see choose_chunk_items). This is
required to handle buffer descriptors (which are smaller than buffers),
and it should also help to reduce the number of mappings. Most NUMA
systems will use 1GB chunks, unless using very small shared buffers.
Notes:
* The feature is enabled by numa_buffers_interleave GUC (false by default)
* It's not clear we want to enable interleaving for all shared memory.
We probably want that for shared buffers, but maybe not for ProcArray
or freelists.
* Similar questions are about huge pages - in general it's a good idea,
but maybe it's not quite good for ProcArray. It's somewhate separate
from NUMA, but not entirely because NUMA works on page granularity.
PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
as we don't want to split the entry to multiple nodes. But could be
done explicitly, by specifying which node to use for the pages.
* We could partition ProcArray, with one partition per NUMA node, and
then at connection time pick a node from the same node. The process
could migrate to some other node later, especially for long-lived
connections, but there's no perfect solution, Maybe we could set
affinity to cores from the same node, or something like that?
---
src/backend/storage/buffer/buf_init.c | 384 +++++++++++++++++++++++++-
src/backend/storage/buffer/bufmgr.c | 1 +
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_tables.c | 10 +
src/bin/pgbench/pgbench.c | 67 ++---
src/include/miscadmin.h | 2 +
src/include/storage/bufmgr.h | 1 +
7 files changed, 427 insertions(+), 41 deletions(-)
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..2ad34624c49 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
*/
#include "postgres.h"
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
@@ -25,6 +33,19 @@ WritebackContext BackendWritebackContext;
CkptSortItem *CkptBufferIds;
+static Size get_memory_page_size(void);
+static int64 choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes);
+static void pg_numa_interleave_memory(char *startptr, char *endptr,
+ Size mem_page_size, Size chunk_size,
+ int num_nodes);
+
+/* number of buffers allocated on the same NUMA node */
+static int64 numa_chunk_buffers = -1;
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int numa_nodes = -1;
+
+
/*
* Data Structures:
* buffers live in a freelist and a lookup data structure.
@@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
foundDescs,
foundIOCV,
foundBufCkpt;
+ Size mem_page_size;
+ Size buffer_align;
+
+ /*
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ *
+ * XXX Another issue is we may get different values than when sizing the
+ * the memory, because at that point we didn't know if we get huge pages,
+ * so we assumed we will. Shouldn't cause crashes, but we might allocate
+ * shared memory and then not use some of it (because of the alignment
+ * that we don't actually need). Not sure about better way, good for now.
+ */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
+
+ /*
+ * With NUMA we need to ensure the buffers are properly aligned not just
+ * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+ * on page granularity, and we don't want a buffer to get split to
+ * multiple nodes (when using multiple memory pages).
+ *
+ * We also don't want to interfere with other parts of shared memory,
+ * which could easily happen with huge pages (e.g. with data stored before
+ * buffers).
+ *
+ * We do this by aligning to the larger of the two values (we know both
+ * are power-of-two values, so the larger value is automatically a
+ * multiple of the lesser one).
+ *
+ * XXX Maybe there's a way to use less alignment?
+ *
+ * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+ * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+ * that doesn't seem quite worth it. Maybe we should simply align to
+ * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+ * other stuff stored in shared memory that we want to allocate on a
+ * particular NUMA node (e.g. ProcArray).
+ *
+ * XXX Maybe with "too large" huge pages we should just not do this, or
+ * maybe do this only for sufficiently large areas (e.g. shared buffers,
+ * but not ProcArray).
+ */
+ buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+ /* one page is a multiple of the other */
+ Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+ ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
- /* Align descriptors to a cacheline boundary. */
+ /*
+ * Align descriptors to a cacheline boundary, and memory page.
+ *
+ * We want to distribute both to NUMA nodes, so that each buffer and it's
+ * descriptor are on the same NUMA node. So we align both the same way.
+ *
+ * XXX The memory page is always larger than cacheline, so the cacheline
+ * reference is a bit unnecessary.
+ *
+ * XXX In principle we only need to do this with NUMA, otherwise we could
+ * still align just to cacheline, as before.
+ */
BufferDescriptors = (BufferDescPadded *)
- ShmemInitStruct("Buffer Descriptors",
- NBuffers * sizeof(BufferDescPadded),
- &foundDescs);
+ TYPEALIGN(buffer_align,
+ ShmemInitStruct("Buffer Descriptors",
+ NBuffers * sizeof(BufferDescPadded) + buffer_align,
+ &foundDescs));
/* Align buffer pool on IO page size boundary. */
BufferBlocks = (char *)
- TYPEALIGN(PG_IO_ALIGN_SIZE,
+ TYPEALIGN(buffer_align,
ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+ NBuffers * (Size) BLCKSZ + buffer_align,
&foundBufs));
/* Align condition variables to cacheline boundary. */
@@ -112,6 +195,63 @@ BufferManagerShmemInit(void)
{
int i;
+ /*
+ * Assign chunks of buffers and buffer descriptors to the available
+ * NUMA nodes. We can't use the regular interleaving, because with
+ * regular memory pages (smaller than BLCKSZ) we'd split all buffers
+ * to multiple NUMA nodes. And we don't want that.
+ *
+ * But even with huge pages it seems like a good idea to not have
+ * mapping for each page.
+ *
+ * So we always assign a larger contiguous chunk of buffers to the
+ * same NUMA node, as calculated by choose_chunk_buffers(). We try to
+ * keep the chunks large enough to work both for buffers and buffer
+ * descriptors, but not too large. See the comments at
+ * choose_chunk_buffers() for details.
+ *
+ * Thanks to the earlier alignment (to memory page etc.), we know the
+ * buffers won't get split, etc.
+ *
+ * This also makes it easier / straightforward to calculate which NUMA
+ * node a buffer belongs to (it's a matter of divide + mod). See
+ * BufferGetNode().
+ */
+ if (numa_buffers_interleave)
+ {
+ char *startptr,
+ *endptr;
+ Size chunk_size;
+
+ numa_nodes = numa_num_configured_nodes();
+
+ numa_chunk_buffers
+ = choose_chunk_buffers(NBuffers, mem_page_size, numa_nodes);
+
+ elog(LOG, "BufferManagerShmemInit num_nodes %d chunk_buffers %ld",
+ numa_nodes, numa_chunk_buffers);
+
+ /* first map buffers */
+ startptr = BufferBlocks;
+ endptr = startptr + ((Size) NBuffers) * BLCKSZ;
+ chunk_size = (numa_chunk_buffers * BLCKSZ);
+
+ pg_numa_interleave_memory(startptr, endptr,
+ mem_page_size,
+ chunk_size,
+ numa_nodes);
+
+ /* now do the same for buffer descriptors */
+ startptr = (char *) BufferDescriptors;
+ endptr = startptr + ((Size) NBuffers) * sizeof(BufferDescPadded);
+ chunk_size = (numa_chunk_buffers * sizeof(BufferDescPadded));
+
+ pg_numa_interleave_memory(startptr, endptr,
+ mem_page_size,
+ chunk_size,
+ numa_nodes);
+ }
+
/*
* Initialize all the buffer headers.
*/
@@ -144,6 +284,11 @@ BufferManagerShmemInit(void)
GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
+ /*
+ * As this point we have all the buffers in a single long freelist. With
+ * freelist partitioning we rebuild them in StrategyInitialize.
+ */
+
/* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);
@@ -152,24 +297,72 @@ BufferManagerShmemInit(void)
&backend_flush_after);
}
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+ Size os_page_size;
+ Size huge_page_size;
+
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ /* assume huge pages get used, unless HUGE_PAGES_OFF */
+ if (huge_pages_status != HUGE_PAGES_OFF)
+ GetHugePageSize(&huge_page_size, NULL);
+ else
+ huge_page_size = 0;
+
+ return Max(os_page_size, huge_page_size);
+}
+
/*
* BufferManagerShmemSize
*
* compute the size of shared memory for the buffer pool including
* data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
*/
Size
BufferManagerShmemSize(void)
{
Size size = 0;
+ Size mem_page_size;
+
+ /* XXX why does IsUnderPostmaster matter? */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
/* size of buffer descriptors */
size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
/* to allow aligning buffer descriptors */
- size = add_size(size, PG_CACHE_LINE_SIZE);
+ size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
/* size of data pages, plus alignment padding */
- size = add_size(size, PG_IO_ALIGN_SIZE);
+ size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
@@ -186,3 +379,178 @@ BufferManagerShmemSize(void)
return size;
}
+
+/*
+ * choose_chunk_buffers
+ * choose the number of buffers allocated to a NUMA node at once
+ *
+ * We don't map shared buffers to NUMA nodes one by one, but in larger chunks.
+ * This is both for efficiency reasons (fewer mappings), and also because we
+ * want to map buffer descriptors too - and descriptors are much smaller. So
+ * we pick a number that's high enough for descriptors to use whole pages.
+ *
+ * We also want to keep buffers somehow evenly distributed on nodes, with
+ * about NBuffers/nodes per node. So we don't use chunks larger than this,
+ * to keep it as fair as possible (the chunk size is a possible difference
+ * between memory allocated to different NUMA nodes).
+ *
+ * It's possible shared buffers are so small this is not possible (i.e.
+ * it's less than chunk_size). But sensible NUMA systems will use a lot
+ * of memory, so this is unlikely.
+ *
+ * We simply print a warning about the misbalance, and that's it.
+ *
+ * XXX It'd be good to ensure the chunk size is a power-of-2, because then
+ * we could calculate the NUMA node simply by shift/modulo, while now we
+ * have to do a division. But we don't know how many buffers and buffer
+ * descriptors fits into a memory page. It may not be a power-of-2.
+ */
+static int64
+choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
+{
+ int64 num_items;
+ int64 max_items;
+
+ /* make sure the chunks will align nicely */
+ Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+ Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
+ Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
+
+ /*
+ * The minimum number of items to fill a memory page with descriptors and
+ * blocks. The NUMA allocates memory in pages, and we need to do that for
+ * both buffers and descriptors.
+ *
+ * In practice the BLCKSZ doesn't really matter, because it's much larger
+ * than BufferDescPadded, so the result is determined buffer descriptors.
+ * But it's clearer this way.
+ */
+ num_items = Max(mem_page_size / sizeof(BufferDescPadded),
+ mem_page_size / BLCKSZ);
+
+ /*
+ * We shouldn't use chunks larger than NBuffers/num_nodes, because with
+ * larger chunks the last NUMA node would end up with much less memory (or
+ * no memory at all).
+ */
+ max_items = (NBuffers / num_nodes);
+
+ /*
+ * Did we already exceed the maximum desirable chunk size? That is, will
+ * the last node get less than one whole chunk (or no memory at all)?
+ */
+ if (num_items > max_items)
+ elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
+ num_items, max_items);
+
+ /* grow the chunk size until we hit the max limit. */
+ while (2 * num_items <= max_items)
+ num_items *= 2;
+
+ /*
+ * XXX It's not difficult to construct cases where we end up with not
+ * quite balanced distribution. For example, with shared_buffers=10GB and
+ * 4 NUMA nodes, we end up with 2GB chunks, which means the first node
+ * gets 4GB, and the three other nodes get 2GB each.
+ *
+ * We could be smarter, and try to get more balanced distribution. We
+ * could simply reduce max_items e.g. to
+ *
+ * max_items = (NBuffers / num_nodes) / 4;
+ *
+ * in which cases we'd end up with 512MB chunks, and each nodes would get
+ * the same 2.5GB chunk. It may not always work out this nicely, but it's
+ * better than with (NBuffers / num_nodes).
+ *
+ * Alternatively, we could "backtrack" - try with the large max_items,
+ * check how balanced it is, and if it's too imbalanced, try with a
+ * smaller one.
+ *
+ * We however want a simple scheme.
+ */
+
+ return num_items;
+}
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+ /* not NUMA interleaving */
+ if (numa_chunk_buffers == -1)
+ return -1;
+
+ return (buffer / numa_chunk_buffers) % numa_nodes;
+}
+
+/*
+ * pg_numa_interleave_memory
+ * move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ * of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_interleave_memory(char *startptr, char *endptr,
+ Size mem_page_size, Size chunk_size,
+ int num_nodes)
+{
+ volatile uint64 touch pg_attribute_unused();
+ char *ptr = startptr;
+
+ /* chunk size has to be a multiple of memory page */
+ Assert((chunk_size % mem_page_size) == 0);
+
+ /*
+ * Walk the memory pages in the range, and determine the node for each
+ * one. We use numa_tonode_memory(), because then we can move a whole
+ * memory range to the node, we don't need to worry about individual pages
+ * like with numa_move_pages().
+ */
+ while (ptr < endptr)
+ {
+ /* We may have an incomplete chunk at the end. */
+ Size sz = Min(chunk_size, (endptr - ptr));
+
+ /*
+ * What NUMA node does this range belong to? Each chunk should go to
+ * the same NUMA node, in a round-robin manner.
+ */
+ int node = ((ptr - startptr) / chunk_size) % num_nodes;
+
+ /*
+ * Nope, we have the first buffer from the next memory page, and we'll
+ * set NUMA node for it (and all pages up to the next buffer). The
+ * buffer should align with the memory page, thanks to the
+ * buffer_align earlier.
+ */
+ Assert((int64) ptr % mem_page_size == 0);
+ Assert((sz % mem_page_size) == 0);
+
+ /*
+ * XXX no return value, to make this fail on error, has to use
+ * numa_set_strict
+ *
+ * XXX Should we still touch the memory first, like with numa_move_pages,
+ * or is that not necessary?
+ */
+ numa_tonode_memory(ptr, sz, node);
+
+ ptr += sz;
+ }
+
+ /* should have processed all chunks */
+ Assert(ptr == endptr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 94db3e7c976..5922689fe5d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -685,6 +685,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
BufferDesc *bufHdr;
BufferTag tag;
uint32 buf_state;
+
Assert(BufferIsValid(recent_buffer));
ResourceOwnerEnlarge(CurrentResourceOwner);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int max_worker_processes = 8;
int max_parallel_workers = 8;
int MaxBackends = 0;
+/* NUMA stuff */
+bool numa_buffers_interleave = false;
+
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..9570087aa60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables NUMA interleaving of shared buffers."),
+ gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+ },
+ &numa_buffers_interleave,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 69b6a877dc9..c07de903f76 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -305,7 +305,7 @@ static const char *progname;
#define CPU_PINNING_RANDOM 1
#define CPU_PINNING_COLOCATED 2
-static int pinning_mode = CPU_PINNING_NONE;
+static int pinning_mode = CPU_PINNING_NONE;
#define WSEP '@' /* weight separator */
@@ -874,20 +874,20 @@ static bool socket_has_input(socket_set *sa, int fd, int idx);
*/
typedef struct cpu_generator_state
{
- int ncpus; /* number of CPUs available */
- int nitems; /* number of items in the queue */
- int *nthreads; /* number of threads for each CPU */
- int *nclients; /* number of processes for each CPU */
- int *items; /* queue of CPUs to pick from */
-} cpu_generator_state;
+ int ncpus; /* number of CPUs available */
+ int nitems; /* number of items in the queue */
+ int *nthreads; /* number of threads for each CPU */
+ int *nclients; /* number of processes for each CPU */
+ int *items; /* queue of CPUs to pick from */
+} cpu_generator_state;
static cpu_generator_state cpu_generator_init(int ncpus);
-static void cpu_generator_refill(cpu_generator_state *state);
-static void cpu_generator_reset(cpu_generator_state *state);
-static int cpu_generator_thread(cpu_generator_state *state);
-static int cpu_generator_client(cpu_generator_state *state, int thread_cpu);
-static void cpu_generator_print(cpu_generator_state *state);
-static bool cpu_generator_check(cpu_generator_state *state);
+static void cpu_generator_refill(cpu_generator_state * state);
+static void cpu_generator_reset(cpu_generator_state * state);
+static int cpu_generator_thread(cpu_generator_state * state);
+static int cpu_generator_client(cpu_generator_state * state, int thread_cpu);
+static void cpu_generator_print(cpu_generator_state * state);
+static bool cpu_generator_check(cpu_generator_state * state);
static void reset_pinning(TState *threads, int nthreads);
@@ -7422,7 +7422,7 @@ main(int argc, char **argv)
/* try to assign threads/clients to CPUs */
if (pinning_mode != CPU_PINNING_NONE)
{
- int nprocs = get_nprocs();
+ int nprocs = get_nprocs();
cpu_generator_state state = cpu_generator_init(nprocs);
retry:
@@ -7433,6 +7433,7 @@ retry:
for (i = 0; i < nthreads; i++)
{
TState *thread = &threads[i];
+
thread->cpu = cpu_generator_thread(&state);
}
@@ -7444,7 +7445,7 @@ retry:
while (true)
{
/* did we find any unassigned backend? */
- bool found = false;
+ bool found = false;
for (i = 0; i < nthreads; i++)
{
@@ -7678,10 +7679,10 @@ threadRun(void *arg)
/* determine PID of the backend, pin it to the same CPU */
for (int i = 0; i < nstate; i++)
{
- char *pid_str;
- pid_t pid;
+ char *pid_str;
+ pid_t pid;
- PGresult *res = PQexec(state[i].con, "select pg_backend_pid()");
+ PGresult *res = PQexec(state[i].con, "select pg_backend_pid()");
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pg_fatal("could not determine PID of the backend for client %d",
@@ -8184,7 +8185,7 @@ cpu_generator_init(int ncpus)
{
struct timeval tv;
- cpu_generator_state state;
+ cpu_generator_state state;
state.ncpus = ncpus;
@@ -8207,7 +8208,7 @@ cpu_generator_init(int ncpus)
}
static void
-cpu_generator_refill(cpu_generator_state *state)
+cpu_generator_refill(cpu_generator_state * state)
{
struct timeval tv;
@@ -8223,7 +8224,7 @@ cpu_generator_refill(cpu_generator_state *state)
}
static void
-cpu_generator_reset(cpu_generator_state *state)
+cpu_generator_reset(cpu_generator_state * state)
{
state->nitems = 0;
cpu_generator_refill(state);
@@ -8236,15 +8237,15 @@ cpu_generator_reset(cpu_generator_state *state)
}
static int
-cpu_generator_thread(cpu_generator_state *state)
+cpu_generator_thread(cpu_generator_state * state)
{
if (state->nitems == 0)
cpu_generator_refill(state);
while (true)
{
- int idx = lrand48() % state->nitems;
- int cpu = state->items[idx];
+ int idx = lrand48() % state->nitems;
+ int cpu = state->items[idx];
state->items[idx] = state->items[state->nitems - 1];
state->nitems--;
@@ -8256,10 +8257,10 @@ cpu_generator_thread(cpu_generator_state *state)
}
static int
-cpu_generator_client(cpu_generator_state *state, int thread_cpu)
+cpu_generator_client(cpu_generator_state * state, int thread_cpu)
{
- int min_clients;
- bool has_valid_cpus = false;
+ int min_clients;
+ bool has_valid_cpus = false;
for (int i = 0; i < state->nitems; i++)
{
@@ -8284,8 +8285,8 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
while (true)
{
- int idx = lrand48() % state->nitems;
- int cpu = state->items[idx];
+ int idx = lrand48() % state->nitems;
+ int cpu = state->items[idx];
if (cpu == thread_cpu)
continue;
@@ -8303,7 +8304,7 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
}
static void
-cpu_generator_print(cpu_generator_state *state)
+cpu_generator_print(cpu_generator_state * state)
{
for (int i = 0; i < state->ncpus; i++)
{
@@ -8312,10 +8313,10 @@ cpu_generator_print(cpu_generator_state *state)
}
static bool
-cpu_generator_check(cpu_generator_state *state)
+cpu_generator_check(cpu_generator_state * state)
{
- int min_count = INT_MAX,
- max_count = 0;
+ int min_count = INT_MAX,
+ max_count = 0;
for (int i = 0; i < state->ncpus; i++)
{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..c257c8a1c20 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -319,6 +319,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
/* in buf_init.c */
extern void BufferManagerShmemInit(void);
extern Size BufferManagerShmemSize(void);
+extern int BufferGetNode(Buffer buffer);
/* in localbuf.c */
extern void AtProcExit_LocalBuffers(void);
--
2.49.0
On 7/4/25 20:12, Tomas Vondra wrote:
On 7/4/25 13:05, Jakub Wartak wrote:
...
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0Yeah, good catch. I'll look into that next week.
I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?
regards
--
Tomas Vondra
Hi,
On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:
Here's a v2 of the patch series, with a couple changes:
Not a deep look at the code, just a quick reply.
* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). So
I assume s/not/now/?
* There's now a patch partitioning clocksweep, using the same scheme as
the freelists.
Nice!
I came to the conclusion it doesn't make much sense to partition these
things differently - I can't think of a reason why that would be
advantageous, and it makes it easier to reason about.
Agreed.
The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.
That probably makes sense for now. It might need a bit of a larger adjustment
at some point, but ...
* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.
I think it's totally fair to enable/disable them at the same time. They're so
closely related, that I don't think it really makes sense to measure them
separately.
I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.
Personally I think an 1.18x improvement on a relatively small NUMA machine is
really rather awesome.
For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.
Was that with pinning etc enabled or not?
From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v2 5/7] NUMA: clockweep partitioning
@@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r /* * Nothing on the freelist, so run the "clock sweep" algorithm * - * XXX Should we also make this NUMA-aware, to only access buffers from - * the same NUMA node? That'd probably mean we need to make the clock - * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a - * subset of buffers. But that also means each process could "sweep" only - * a fraction of buffers, even if the other buffers are better candidates - * for eviction. Would that also mean we'd have multiple bgwriters, one - * for each node, or would one bgwriter handle all of that? + * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at + * buffers from a single partition, aligned with the NUMA node. That + * means it only accesses buffers from the same NUMA node. + * + * XXX That also means each process "sweeps" only a fraction of buffers, + * even if the other buffers are better candidates for eviction. Maybe + * there should be some logic to "steal" buffers from other freelists + * or other nodes?
I think we *definitely* need "stealing" from other clock sweeps, whenever
there's a meaningful imbalance between the different sweeps.
I don't think we need to be overly precise about it, a small imbalance won't
have that much of an effect. But clearly it doesn't make sense to say that one
backend can only fill buffers in the current partition, that'd lead to massive
performance issues in a lot of workloads.
The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.
One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.
Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.
Greetings,
Andres Freund
On 7/18/25 18:46, Andres Freund wrote:
Hi,
On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:
Here's a v2 of the patch series, with a couple changes:
Not a deep look at the code, just a quick reply.
* I changed the freelist partitioning scheme a little bit, based on the
discussion in this thread. Instead of having a single "partition" per
NUMA node, there's not a minimum number of partitions (set to 4). SoI assume s/not/now/?
Yes.
* There's now a patch partitioning clocksweep, using the same scheme as
the freelists.Nice!
I came to the conclusion it doesn't make much sense to partition these
things differently - I can't think of a reason why that would be
advantageous, and it makes it easier to reason about.Agreed.
The clocksweep partitioning is somewhat harder, because it affects
BgBufferSync() and related code. With the partitioning we now have
multiple "clock hands" for different ranges of buffers, and the clock
sweep needs to consider that. I modified BgBufferSync to simply loop
through the ClockSweep partitions, and do a small cleanup for each.That probably makes sense for now. It might need a bit of a larger
adjustment at some point, but ...
I couldn't think of something fundamentally better and not too complex.
I suspect we might want to use multiple bgwriters in the future, and
this scheme seems to be reasonably well suited for that too.
I'm also thinking about having some sort of "unified" partitioning
scheme for all the places partitioning shared buffers. Right now each of
the places does it on it's own, i.e. buff_init, freelist and clocksweep
all have their code splitting NBuffers into partitions. And it should
align. Because what would be the benefit if it didn't? But I guess
having three variants of the same code seems a bit pointless.
I think buff_init should build a common definition of buffer partitions,
and the remaining parts should use that as the source of truth ...
* This new freelist/clocksweep partitioning scheme is however harder to
disable. I now realize the GUC may quite do the trick, and there even is
not a GUC for the clocksweep. I need to think about this, but I'm not
even how feasible it'd be to have two separate GUCs (because of how
these two pieces are intertwined). For now if you want to test without
the partitioning, you need to skip the patch.I think it's totally fair to enable/disable them at the same time. They're so
closely related, that I don't think it really makes sense to measure them
separately.
Yeah, that's a fair point.
I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared buffers,
within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
bad considering the patch is more about consistency than raw throughput.Personally I think an 1.18x improvement on a relatively small NUMA machine is
really rather awesome.
True, but I want to stress out it's just one quick (& simple test). Much
more testing is needed before I can make reliable claims.
For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.Was that with pinning etc enabled or not?
IIRC it was with everything enabled, except for numa_procs_pin (which
pins backend to NUMA node). I found that to actually harm performance in
some of the tests (even just read-only ones), resulting in uneven usage
of cores and lower throughput.
From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v2 5/7] NUMA: clockweep partitioning@@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r /* * Nothing on the freelist, so run the "clock sweep" algorithm * - * XXX Should we also make this NUMA-aware, to only access buffers from - * the same NUMA node? That'd probably mean we need to make the clock - * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a - * subset of buffers. But that also means each process could "sweep" only - * a fraction of buffers, even if the other buffers are better candidates - * for eviction. Would that also mean we'd have multiple bgwriters, one - * for each node, or would one bgwriter handle all of that? + * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at + * buffers from a single partition, aligned with the NUMA node. That + * means it only accesses buffers from the same NUMA node. + * + * XXX That also means each process "sweeps" only a fraction of buffers, + * even if the other buffers are better candidates for eviction. Maybe + * there should be some logic to "steal" buffers from other freelists + * or other nodes?I think we *definitely* need "stealing" from other clock sweeps, whenever
there's a meaningful imbalance between the different sweeps.I don't think we need to be overly precise about it, a small imbalance won't
have that much of an effect. But clearly it doesn't make sense to say that one
backend can only fill buffers in the current partition, that'd lead to massive
performance issues in a lot of workloads.
Agreed.
The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.
What you mean by "it"? What would happen after a sweep wraps around?
Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.
I'm afraid having hard "avoid" flags would lead to sudden and unexpected
changes in performance as we enable/disable partitions. I think a good
solution should "smooth it out" somehow, e.g. by not having a true/false
flag, but having some sort of "preference" factor with values between
(0.0, 1.0) which says how much we should use that partition.
I was imagining something like this:
Say we know the number of buffers allocated for each partition (in the
last round), and we (or rather the BgBufferSync) calculate:
coefficient = 1.0 - (nallocated_partition / nallocated)
and then use that to "correct" which partition to allocate buffers from.
Or maybe just watch how far from the "fair share" we were in the last
interval, and gradually increase/decrease the "partition preference"
which would say how often we need to "steal" from other partitions.
E.g. we find nallocated_partition is 2x the fair share, i.e.
nallocated_partition / (nallocated / nparts) = 2.0
Then we say 25% of the time look at some other partition, to "cut" the
imbalance in half. And then repeat that in the next cycle, etc.
So a process would look at it's "home partition" by default, but it's
"roll a dice" first and if above the calculated probability it'd pick
some other partition instead (this would need to be done so that it gets
balanced overall).
If the bgwriter interval is too long, maybe the recalculation could be
triggered regularly after any of the clocksweeps wraps around, or after
some number of allocations, or something like that.
regards
--
Tomas Vondra
Hi,
On 2025-07-18 22:48:00 +0200, Tomas Vondra wrote:
On 7/18/25 18:46, Andres Freund wrote:
For a read-write pgbench I however saw some strange drops/increases of
throughput. I suspect this might be due to some thinko in the clocksweep
partitioning, but I'll need to take a closer look.Was that with pinning etc enabled or not?
IIRC it was with everything enabled, except for numa_procs_pin (which
pins backend to NUMA node). I found that to actually harm performance in
some of the tests (even just read-only ones), resulting in uneven usage
of cores and lower throughput.
FWIW, I really doubt that something like numa_procs_pin is viable outside of
very narrow niches until we have a *lot* more infrastructure in place. Like PG
would need to be threaded, we'd need a separation between thread and
connection and an executor that'd allow us to switch from working on one query
to working on another query.
The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.What you mean by "it"?
it := Considering switching back from using a "foreign" clock sweep instance
whenever the sweep wraps around.
What would happen after a sweep wraps around?
The scenario I'm worried about is this:
1) a bunch of backends read buffers on numa node A, using the local clock
sweep instance
2) due to all of that activity, the clock sweep advances much faster than the
clock sweep for numa node B
3) the clock sweep on A wraps around, we discover the imbalance, and all the
backend switch to scanning on numa node B, moving that clock sweep ahead
much more aggressively
4) clock sweep on B wraps around, there's imbalance the other way round now,
so they all switch back to A
Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.I'm afraid having hard "avoid" flags would lead to sudden and unexpected
changes in performance as we enable/disable partitions. I think a good
solution should "smooth it out" somehow, e.g. by not having a true/false
flag, but having some sort of "preference" factor with values between
(0.0, 1.0) which says how much we should use that partition.
Yea, I think that's a fair worry.
I was imagining something like this:
Say we know the number of buffers allocated for each partition (in the
last round), and we (or rather the BgBufferSync) calculate:coefficient = 1.0 - (nallocated_partition / nallocated)
and then use that to "correct" which partition to allocate buffers from.
Or maybe just watch how far from the "fair share" we were in the last
interval, and gradually increase/decrease the "partition preference"
which would say how often we need to "steal" from other partitions.E.g. we find nallocated_partition is 2x the fair share, i.e.
nallocated_partition / (nallocated / nparts) = 2.0
Then we say 25% of the time look at some other partition, to "cut" the
imbalance in half. And then repeat that in the next cycle, etc.So a process would look at it's "home partition" by default, but it's
"roll a dice" first and if above the calculated probability it'd pick
some other partition instead (this would need to be done so that it gets
balanced overall).
That does sound reasonable.
If the bgwriter interval is too long, maybe the recalculation could be
triggered regularly after any of the clocksweeps wraps around, or after
some number of allocations, or something like that.
I'm pretty sure the bgwriter might not be often enough and not predictably
frequently running for that.
Greetings,
Andres Freund
On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote:
On 7/4/25 20:12, Tomas Vondra wrote:
On 7/4/25 13:05, Jakub Wartak wrote:
...
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0Yeah, good catch. I'll look into that next week.
I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?
Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:
For complete picture, steps:
1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch
BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej
3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.
4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)
5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok
6. restart with numa_procs_interleave=off, which gets me to every bg
worker crashing e.g.:
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at
./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x0000563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000563e2d774d93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
#4 0x0000563e2d6e9252 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000563e2d6eb683 in CheckpointerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/postmaster/checkpointer.c:190
#6 0x0000563e2d6ec363 in postmaster_child_launch
(child_type=child_type@entry=B_CHECKPOINTER, child_slot=249,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x0000563e2d6ee29a in StartChildProcess
(type=type@entry=B_CHECKPOINTER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x0000563e2d6f17a6 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x563e377cc0e0) at
../src/backend/postmaster/postmaster.c:1386
#9 0x0000563e2d4948fc in main (argc=3, argv=0x563e377cc0e0) at
../src/backend/main/main.c:231
notice sema=0x0, because:
#3 0x000056050928cd93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
995 PGSemaphoreReset(MyProc->sem);
(gdb) print MyProc
$1 = (PGPROC *) 0x7f09a0c013b0
(gdb) print MyProc->sem
$2 = (PGSemaphore) 0x0
or with printfs:
2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal
PGPROC=0x7f9de827b880 requestSize=148770
// after proc && ptr manipulation:
2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal
PGPROC=0x7f9de827bdf0 requestSize=148770 procs=0x7f9de827b880
ptr=0x7f9de827bdf0
[..initialization of aux PGPROCs i=0.., still fromInitProcGlobal(),
each gets proper sem allocated as one would expect:]
[..for i loop:]
2025-07-25 11:17:23.689 CEST [21772] LOG: i=136 ,
proc=0x7f9de8600000, proc->sem=0x7f9da4e04438
2025-07-25 11:17:23.689 CEST [21772] LOG: i=137 ,
proc=0x7f9de8600348, proc->sem=0x7f9da4e044b8
2025-07-25 11:17:23.689 CEST [21772] LOG: i=138 ,
proc=0x7f9de8600690, proc->sem=0x7f9da4e04538
[..but then in the children codepaths, out of the blue in
InitAuxilaryProcess the whole MyProc looks like it would memsetted to
zeros:]
2025-07-25 11:17:23.693 CEST [21784] LOG: auxiliary process using
MyProc=0x7f9de8600000 auxproc=0x7f9de8600000 proctype=0
MyProcPid=21784 MyProc->sem=(nil)
above got pgproc slot i=136 with addr 0x7f9de8600000 and later that
auxiliary is launched but somehow something NULLified ->sem there
(according to gdb , everything is zero there)
7. Original patch v2-0006 (with commented out 2x if
numa_procs_interleave), behaves OK, so in my case here with 1x NUMA
node that gives add_size(.., 1+1 * 2MB)=4MB
2025-07-25 11:38:54.131 CEST [23939] LOG: in InitProcGlobal
PGPROC=0x7f25cbe7b880 requestSize=4343074
2025-07-25 11:38:54.132 CEST [23939] LOG: in InitProcGlobal
PGPROC=0x7f25cbe7bdf0 requestSize=4343074 procs=0x7f25cbe7b880
ptr=0x7f25cbe7bdf0
so something is zeroing out all those MyProc structures apparently on
startup (probably due to some wrong alignment maybe somewhere ?) I was
thinking about trapping via mprotect() this single i=136
0x7f9de8600000 PGPROC to see what is resetting it, but oh well,
mprotect() works only on whole pages...
-J.
On 7/25/25 12:27, Jakub Wartak wrote:
On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote:
On 7/4/25 20:12, Tomas Vondra wrote:
On 7/4/25 13:05, Jakub Wartak wrote:
...
8. v1-0005 2x + /* if (numa_procs_interleave) */
Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
[..]
2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
was terminated by signal 11: Segmentation fault
2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
active server processes
2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
"restart_after_crash" is off
2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down[New LWP 28755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: io worker '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
136 ./nptl/sem_waitcommon.c: No such file or directory.
(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
#4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
#6 0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
[..]
(gdb) print *MyProc->sem
Cannot access memory at address 0x0Yeah, good catch. I'll look into that next week.
I've been unable to reproduce this issue, but I'm not sure what settings
you actually used for this instance. Can you give me more details how to
reproduce this?Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:For complete picture, steps:
1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patchBTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)
Ah, I didn't realize you uncommented these "if" conditions. In that case
the crash is not very surprising, because the actual initialization in
InitProcGlobal ignores the GUCs and just assumes it's enabled. But
without the extra padding that likely messes up something. Or something
allocated later "overwrites" the some of the memory.
I need to clean this up, to actually consider the GUC properly.
FWIW I do have a new patch version that I plan to share in a day or two,
once I get some numbers. It didn't change this particular part, though,
it's more about the buffers/freelists/clocksweep. I'll work on PGPROC
next, I think.
regards
--
Tomas Vondra
Hi,
Here's a somewhat cleaned up v3 of this patch series, with various
improvements and a lot of cleanup. Still WIP, but I hope it resolves the
various crashes reported for v2, but it still requires --with-libnuma
(it won't build without it).
I'm aware there's an ongoing discussion about removing the freelists,
and changing the clocksweep in some way. If that happens, the relevant
parts of this series will need some adjustment, of course. I haven't
looked into that yet, I plan to review those patches soon.
main changes in v3
------------------
1) I've introduced "registry" of the buffer partitions (imagine a small
array of structs), serving as a source of truth for places that need
info about the partitions (range of buffers, ...).
With v2 there was no "shared definition" - the shared buffers, freelist
and clocksweep did their own thing. But per the discussion it doesn't
really make much sense for to partition buffers in different ways.
So in v3 the 0001 patch defines the partitions, records them in shared
memory (in a small array), and the later parts just reuse this.
I also added a pg_buffercache_partitions() listing the partitions, with
first/last buffer, etc. The freelist/clocksweep patches add additional
information.
2) The PGPROC part introduces a similar registry, even though there are
no other patches building on this. But it seemed useful to have a clear
place recording this info.
There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.
3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".
4) This still doesn't do anything about "balancing" the clocksweep. I
have some ideas how to do that, I'll work on that next.
simple benchmark
----------------
I did a simple benchmark, measuring pgbench throughput with scale still
fitting into RAM, but much larger (~2x) than shared buffers. See the
attached test script, testing builds with more and more of the patches.
I'm attaching results from two different machines (the "usual" 2P xeon
and also a much larger cloud instance with EPYC/Genoa) - both the raw
CSV files, with average tps and percentiles, and PDFs. The PDFs also
have a comparison either to the "preceding" build (right side), or to
master (below the table).
There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").
For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):
mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%
The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).
For the xeon the differences (in either direction) are much smaller, so
I'm not going to post it here. It's in the PDF, though.
I think this looks reasonable. The way I see this patch series is not
about improving peak throughput, but more about reducing imbalance and
making the behavior more consistent.
The results are more a confirmation there's not some sort of massive
overhead, somewhere. But I'll get to this in a minute.
To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.
overhead of partitioning calculation
------------------------------------
Regarding the "overhead", while the results look mostly OK, I think
we'll need to rethink the partitioning scheme - particularly how the
partition size is calculated. The current scheme has to use %, which can
be somewhat expensive.
The 0001 patch calculates a "chunk size", which is the smallest number
of buffers it can "assign" to a NUMA node. This depends on how many
buffer descriptors fit onto a single memory page, and it's either 512KB
(with 4KB pages), or 256MB (with 2MB huge pages). And then each NUMA
node gets multiple chunks, to cover shared_buffers/num_nodes. But this
can be an arbitrary number - it minimizes the imbalance, but it also
forces the use of % and / in the formulas.
AFAIK if we required the partitions to be 2^k multiples of the chunk
size, we could switch to using shifts and masking. Which is supposed to
be much faster. But I haven't measured this, and the cost is that some
of the nodes could get much less memory. Maybe that's fine.
reserving number of huge pages
------------------------------
The other thing I realized is that partitioning buffers with huge pages
is quite tricky, and can easily lead to SIGBUS when accessing the memory
later. The crashes I saw happen like this:
1) figure # of pages needed (using shared_memory_size_in_huge_pages)
This can be 16828 for shared_buffers=32GB.
2) make sure there's enough huge pages
echo 16828 > /proc/sys/vm/nr_hugepages
3) start postgres - everything seems to works just fine
4) query pg_buffercache_numa - triggers SIGBUS accessing memory for a
valid buffer (usually ~2GB from the end)
It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See
$ numastat -cm
Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.
And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.
You may ask why the per-node limit is too low. We still need just
shared_memory_size_in_huge_pages, right? And if we were partitioning the
whole memory segment, that'd be true. But we only to that for shared
buffers, and there's a lot of other shared memory - could be 1-2GB or
so, depending on the configuration.
And this gets placed on one of the nodes, and it counts against the
limit on that particular node. And so it doesn't have enough huge pages
to back the partition of shared buffers.
The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.
I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.
I also realize this can be used to make sure the memory is balanced on
NUMA systems. Because if you set nr_hugepages, the kernel will ensure
the shared memory is distributed on all the nodes.
It won't have the benefits of "coordinating" the buffers and buffer
descriptors, and so on. But it will be balanced.
regards
--
Tomas Vondra
Attachments:
v3-0001-NUMA-interleaving-buffers.patchtext/x-patch; charset=UTF-8; name=v3-0001-NUMA-interleaving-buffers.patchDownload
From f0eb1af6fdcfd7daae26952ddc223952333f6af2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 28 Jul 2025 14:01:37 +0200
Subject: [PATCH v3 1/7] NUMA: interleaving buffers
Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).
The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc. It's less dependent on what the CPU
scheduler does, etc.
Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.
The effect is similar to
numactl --interleave=all
but there's a number of important differences.
Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).
Secondly, it considers the page and block size, and makes sure to always
put the whole buffer on a single NUMA node (even if it happens to use
multiple memory pages), and to keep the buffer and it's descriptor on
the same NUMA node. The seriousness/likelihood of these issues depends
on the memory page size (regular vs. huge pages).
The mapping of memory to NUMA nodes happens in larger chunks. This is
required to handle buffer descriptors (which are smaller than buffers),
and so many more fit onto a single memory page.
The number of buffer descriptors per memory page determines the smallest
number of buffers that can be placed on a NUMA node. With 2MB huge pages
this is 256MB, with 4KB pages this is 512KB). Nodes get a multiple of
this, and we try to keep the nodes balanced - the last node can get less
memory, though.
The "buffer partitions" may not be 1:1 with NUMA nodes. There's a
minimal number of partitions (default: 4) that will be created even with
fewer NUMA nodes, or no NUMA at all. Each node gets the same number of
partitions, to keep things simple. For example, with 2 nodes there'll be
4 partitions, with each node getting 2 of them. With 3 nodes there'll be
6 partitions (again, 2 per node).
The patch introduces a simple "registry" of buffer partitions, keeping
track of the first/last buffer, NUMA node, etc. This serves as a source
of truth, both for this patch and for later patches building on this
same buffer partition structure.
With the feature disabled (GUC set to 'off'), there'll be a single
partition for all the buffers (and it won't be mapped to a NUMA node).
Notes:
* The feature is enabled by numa_buffers_interleave GUC (default: false)
* It's not clear we want to enable interleaving for all shared memory.
We probably want that for shared buffers, but maybe not for ProcArray
or freelists.
* Similar questions are about huge pages - in general it's a good idea,
but maybe it's not quite good for ProcArray. It's somewhate separate
from NUMA, but not entirely because NUMA works on page granularity.
PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
as we don't want to split the entry to multiple nodes. But could be
done explicitly, by specifying which node to use for the pages.
* We could partition ProcArray, with one partition per NUMA node, and
then at connection time pick a node from the same node. The process
could migrate to some other node later, especially for long-lived
connections, but there's no perfect solution, Maybe we could set
affinity to cores from the same node, or something like that?
---
contrib/pg_buffercache/Makefile | 2 +-
.../pg_buffercache--1.6--1.7.sql | 22 +
contrib/pg_buffercache/pg_buffercache.control | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 92 +++
src/backend/storage/buffer/buf_init.c | 626 +++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/miscadmin.h | 2 +
src/include/storage/buf_internals.h | 6 +
src/include/storage/bufmgr.h | 15 +
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 771 insertions(+), 11 deletions(-)
create mode 100644 contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 5f748543e2e..0e618f66aec 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_buffercache
DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
- pg_buffercache--1.5--1.6.sql
+ pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
REGRESS = pg_buffercache pg_buffercache_numa
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
new file mode 100644
index 00000000000..bd97246f6ab
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_partitions()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_partitions'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_partitions AS
+ SELECT P.* FROM pg_buffercache_partitions() AS P
+ (partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index b030ba3a6fa..11499550945 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
# pg_buffercache extension
comment = 'examine the shared buffer cache'
-default_version = '1.6'
+default_version = '1.7'
module_pathname = '$libdir/pg_buffercache'
relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ae0291e6e96..8baa7c7b543 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,6 +27,7 @@
#define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
#define NUM_BUFFERCACHE_NUMA_ELEM 3
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM 5
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
@@ -100,6 +101,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
+PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
/* Only need to touch memory once per backend process lifetime */
@@ -771,3 +773,93 @@ pg_buffercache_evict_all(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(result);
}
+
+/*
+ * Inquire about partitioning of buffers between NUMA nodes.
+ */
+Datum
+pg_buffercache_partitions(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_PARTITIONS_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_buffers",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "first_buffer",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
+ INT4OID, -1, 0);
+
+ funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = BufferPartitionCount();
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+
+ int numa_node,
+ num_buffers,
+ first_buffer,
+ last_buffer;
+
+ Datum values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
+
+ BufferPartitionGet(i, &numa_node, &num_buffers,
+ &first_buffer, &last_buffer);
+
+ values[0] = Int32GetDatum(i);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(numa_node);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(num_buffers);
+ nulls[2] = false;
+
+ values[3] = Int32GetDatum(first_buffer);
+ nulls[3] = false;
+
+ values[4] = Int32GetDatum(last_buffer);
+ nulls[4] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..5b65a855b29 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
*/
#include "postgres.h"
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
@@ -24,6 +32,19 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
WritebackContext BackendWritebackContext;
CkptSortItem *CkptBufferIds;
+BufferPartitions *BufferPartitionsArray;
+
+static Size get_memory_page_size(void);
+static void buffer_partitions_prepare(void);
+static void buffer_partitions_init(void);
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int numa_nodes = -1; /* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int numa_buffers_per_node = -1; /* buffers per node */
+static int numa_partitions = 0; /* total (multiple of nodes) */
+
/*
* Data Structures:
@@ -70,19 +91,89 @@ BufferManagerShmemInit(void)
bool foundBufs,
foundDescs,
foundIOCV,
- foundBufCkpt;
+ foundBufCkpt,
+ foundParts;
+ Size mem_page_size;
+ Size buffer_align;
+
+ /*
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ *
+ * XXX Another issue is we may get different values than when sizing the
+ * the memory, because at that point we didn't know if we get huge pages,
+ * so we assumed we will. Shouldn't cause crashes, but we might allocate
+ * shared memory and then not use some of it (because of the alignment
+ * that we don't actually need). Not sure about better way, good for now.
+ */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
+
+ /*
+ * With NUMA we need to ensure the buffers are properly aligned not just
+ * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+ * on page granularity, and we don't want a buffer to get split to
+ * multiple nodes (when using multiple memory pages).
+ *
+ * We also don't want to interfere with other parts of shared memory,
+ * which could easily happen with huge pages (e.g. with data stored before
+ * buffers).
+ *
+ * We do this by aligning to the larger of the two values (we know both
+ * are power-of-two values, so the larger value is automatically a
+ * multiple of the lesser one).
+ *
+ * XXX Maybe there's a way to use less alignment?
+ *
+ * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+ * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+ * that doesn't seem quite worth it. Maybe we should simply align to
+ * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+ * other stuff stored in shared memory that we want to allocate on a
+ * particular NUMA node (e.g. ProcArray).
+ *
+ * XXX Maybe with "too large" huge pages we should just not do this, or
+ * maybe do this only for sufficiently large areas (e.g. shared buffers,
+ * but not ProcArray).
+ */
+ buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+ /* one page is a multiple of the other */
+ Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+ ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
+
+ /* allocate the partition registry first */
+ BufferPartitionsArray = (BufferPartitions *)
+ ShmemInitStruct("Buffer Partitions",
+ offsetof(BufferPartitions, partitions) +
+ mul_size(sizeof(BufferPartition), numa_partitions),
+ &foundParts);
- /* Align descriptors to a cacheline boundary. */
+ /*
+ * Align descriptors to a cacheline boundary, and memory page.
+ *
+ * We want to distribute both to NUMA nodes, so that each buffer and it's
+ * descriptor are on the same NUMA node. So we align both the same way.
+ *
+ * XXX The memory page is always larger than cacheline, so the cacheline
+ * reference is a bit unnecessary.
+ *
+ * XXX In principle we only need to do this with NUMA, otherwise we could
+ * still align just to cacheline, as before.
+ */
BufferDescriptors = (BufferDescPadded *)
- ShmemInitStruct("Buffer Descriptors",
- NBuffers * sizeof(BufferDescPadded),
- &foundDescs);
+ TYPEALIGN(buffer_align,
+ ShmemInitStruct("Buffer Descriptors",
+ NBuffers * sizeof(BufferDescPadded) + buffer_align,
+ &foundDescs));
/* Align buffer pool on IO page size boundary. */
BufferBlocks = (char *)
- TYPEALIGN(PG_IO_ALIGN_SIZE,
+ TYPEALIGN(buffer_align,
ShmemInitStruct("Buffer Blocks",
- NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+ NBuffers * (Size) BLCKSZ + buffer_align,
&foundBufs));
/* Align condition variables to cacheline boundary. */
@@ -112,6 +203,12 @@ BufferManagerShmemInit(void)
{
int i;
+ /*
+ * Initialize the registry of buffer partitions, and also move the
+ * memory to different NUMA nodes (if enabled by GUC)
+ */
+ buffer_partitions_init();
+
/*
* Initialize all the buffer headers.
*/
@@ -144,6 +241,11 @@ BufferManagerShmemInit(void)
GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
+ /*
+ * As this point we have all the buffers in a single long freelist. With
+ * freelist partitioning we rebuild them in StrategyInitialize.
+ */
+
/* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);
@@ -152,24 +254,68 @@ BufferManagerShmemInit(void)
&backend_flush_after);
}
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+ Size os_page_size;
+ Size huge_page_size;
+
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ /* assume huge pages get used, unless HUGE_PAGES_OFF */
+ if (huge_pages_status != HUGE_PAGES_OFF)
+ GetHugePageSize(&huge_page_size, NULL);
+ else
+ huge_page_size = 0;
+
+ return Max(os_page_size, huge_page_size);
+}
+
/*
* BufferManagerShmemSize
*
* compute the size of shared memory for the buffer pool including
* data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
*/
Size
BufferManagerShmemSize(void)
{
Size size = 0;
+ /* calculate partition info for buffers */
+ buffer_partitions_prepare();
+
/* size of buffer descriptors */
size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
/* to allow aligning buffer descriptors */
- size = add_size(size, PG_CACHE_LINE_SIZE);
+ size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
/* size of data pages, plus alignment padding */
- size = add_size(size, PG_IO_ALIGN_SIZE);
+ size = add_size(size, Max(numa_page_size, PG_IO_ALIGN_SIZE));
size = add_size(size, mul_size(NBuffers, BLCKSZ));
/* size of stuff controlled by freelist.c */
@@ -184,5 +330,467 @@ BufferManagerShmemSize(void)
/* size of checkpoint sort array in bufmgr.c */
size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+ /* account for registry of NUMA partitions */
+ size = add_size(size, MAXALIGN(offsetof(BufferPartitions, partitions) +
+ mul_size(sizeof(BufferPartition), numa_partitions)));
+
return size;
}
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+ /* not NUMA interleaving */
+ if (numa_buffers_per_node == -1)
+ return 0;
+
+ return (buffer / numa_buffers_per_node);
+}
+
+/*
+ * pg_numa_interleave_memory
+ * move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ * of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+ Size mem_page_size;
+ Size sz;
+
+ /*
+ * Get the "actual" memory page size, not the one we used for sizing. We
+ * might have used huge page for sizing, but only get regular pages when
+ * allocating, so we must use the smaller pages here.
+ *
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
+
+ Assert((int64) startptr % mem_page_size == 0);
+
+ sz = (endptr - startptr);
+ numa_tonode_memory(startptr, sz, node);
+}
+
+
+#define MIN_BUFFER_PARTITIONS 4
+
+/*
+ * buffer_partitions_prepare
+ * Calculate parameters for partitioning buffers.
+ *
+ * We want to split the shared buffers into multiple partitions, of roughly
+ * the same size. This is meant to serve multiple purposes. We want to map
+ * the partitions to different NUMA nodes, to balance memory usage, and
+ * allow partitioning some data structures built on top of buffers, to give
+ * preference to local access (buffers on the same NUMA node). This applies
+ * mostly to freelists and clocksweep.
+ *
+ * We may want to use partitioning even on non-NUMA systems, or when running
+ * on a single NUMA node. Partitioning the freelist/clocksweep is beneficial
+ * even without the NUMA effects.
+ *
+ * So we try to always build at least 4 partitions (MIN_BUFFER_PARTITIONS)
+ * in total, or at least one partition per NUMA node. We always create the
+ * same number of partitions per NUMA node.
+ *
+ * Some examples:
+ *
+ * - non-NUMA system (or 1 NUMA node): 4 partitions for the single node
+ *
+ * - 2 NUMA nodes: 4 partitions, 2 for each node
+ *
+ * - 3 NUMA nodes: 6 partitions, 2 for each node
+ *
+ * - 4+ NUMA nodes: one partition per node
+ *
+ * NUMA works on the memory-page granularity, which determines the smallest
+ * amount of memory we can allocate to single node. This is determined by
+ * how many BufferDescriptors fit onto a single memory page, so this depends
+ * on huge page support. With 2MB huge pages (typical on x86 Linux), this is
+ * 32768 buffers (256MB). With regular 4kB pages, it's 64 buffers (512KB).
+ *
+ * Note: This is determined before the allocation, i.e. we don't know if the
+ * allocation got to use huge pages. So unless huge_pages=off we assume we're
+ * using huge pages.
+ *
+ * This minimal size requirement only matters for the per-node amount of
+ * memory, not for the individual partitions. The partitions for the same
+ * node are a contiguous chunk of memory, which can be split arbitrarily,
+ * it's independent of the NUMA granularity.
+ *
+ * XXX This patch only implements placing the buffers onto different NUMA
+ * nodes. The freelist/clocksweep partitioning is implemented in separate
+ * patches later in the patch series. Those patches however use the same
+ * buffer partition registry, to align the partitions.
+ *
+ *
+ * XXX This needs to consider the minimum chunk size, i.e. we can't split
+ * buffers beyond some point, at some point it gets we run into the size of
+ * buffer descriptors. Not sure if we should give preference to one of these
+ * (probably at least print a warning).
+ *
+ * XXX We want to do this even with numa_buffers_interleave=false, so that the
+ * other patches can do their partitioning. But in that case we don't need to
+ * enforce the min chunk size (probably)?
+ *
+ * XXX We need to only call this once, when sizing the memory. But at that
+ * point we don't know if we get to use huge pages or not (unless when huge
+ * pages are disabled). We'll proceed as if the huge pages were used, and we
+ * may have to use larger partitions. Maybe there's some sort of fallback,
+ * but for now we simply disable the NUMA partitioning - it simply means the
+ * shared buffers are too small.
+ *
+ * XXX We don't need to make each partition a multiple of min_partition_size.
+ * That's something we need to do for a node (because NUMA works at granularity
+ * of pages), but partitions for a single node can split that arbitrarily.
+ * Although keeping the sizes power-of-two would allow calculating everything
+ * as shift/mask, without expensive division/modulo operations.
+ */
+static void
+buffer_partitions_prepare(void)
+{
+ /*
+ * Minimum number of buffers we can allocate to a NUMA node (determined by
+ * how many BufferDescriptors fit onto a memory page).
+ */
+ int min_node_buffers;
+
+ /*
+ * Maximum number of nodes we can split shared buffers to, assuming each
+ * node gets the smallest allocatable chunk (the last node can get a
+ * smaller amount of memory, not the full chunk).
+ */
+ int max_nodes;
+
+ /*
+ * How many partitions to create per node. Could be more than 1 for small
+ * number of nodes (of non-NUMA systems).
+ */
+ int num_partitions_per_node;
+
+ /* bail out if already initialized (calculate only once) */
+ if (numa_nodes != -1)
+ return;
+
+ /* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+ numa_nodes = numa_num_configured_nodes();
+
+ /* XXX can this happen? */
+ if (numa_nodes < 1)
+ numa_nodes = 1;
+
+ elog(WARNING, "IsUnderPostmaster %d", IsUnderPostmaster);
+
+ /*
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ *
+ * XXX Another issue is we may get different values than when sizing the
+ * the memory, because at that point we didn't know if we get huge pages,
+ * so we assumed we will. Shouldn't cause crashes, but we might allocate
+ * shared memory and then not use some of it (because of the alignment
+ * that we don't actually need). Not sure about better way, good for now.
+ */
+ if (IsUnderPostmaster)
+ numa_page_size = pg_get_shmem_pagesize();
+ else
+ numa_page_size = get_memory_page_size();
+
+ /* make sure the chunks will align nicely */
+ Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+ Assert(numa_page_size % sizeof(BufferDescPadded) == 0);
+ Assert(((BLCKSZ % numa_page_size) == 0) || ((numa_page_size % BLCKSZ) == 0));
+
+ /*
+ * The minimum number of buffers we can allocate from a single node, using
+ * the memory page size (determined by buffer descriptors). NUMA allocates
+ * memory in pages, and we need to do that for both buffers and
+ * descriptors at the same time.
+ *
+ * In practice the BLCKSZ doesn't really matter, because it's much larger
+ * than BufferDescPadded, so the result is determined buffer descriptors.
+ */
+ min_node_buffers = (numa_page_size / sizeof(BufferDescPadded));
+
+ /*
+ * Maximum number of nodes (each getting min_node_buffers) we can handle
+ * given the current shared buffers size. The last node is allowed to be
+ * smaller (half of the other nodes).
+ */
+ max_nodes = (NBuffers + (min_node_buffers / 2)) / min_node_buffers;
+
+ /*
+ * Can we actually do NUMA partitioning with these settings? If we can't
+ * handle the current number of nodes, then no.
+ *
+ * XXX This shouldn't be a big issue in practice. NUMA systems typically
+ * run with large shared buffers, which also makes the imbalance issues
+ * fairly significant (it's quick to rebalance 128MB, much slower to do
+ * that for 256GB).
+ */
+ numa_can_partition = true; /* assume we can allocate to nodes */
+ if (numa_nodes > max_nodes)
+ {
+ elog(WARNING, "shared buffers too small for %d nodes (max nodes %d)",
+ numa_nodes, max_nodes);
+ numa_can_partition = false;
+ }
+
+ /*
+ * We know we can partition to the desired number of nodes, now it's time
+ * to figure out how many partitions we need per node. We simply add
+ * partitions per node until we reach MIN_BUFFER_PARTITIONS.
+ *
+ * XXX Maybe we should make sure to keep the actual partition size a power
+ * of 2, to make the calculations simpler (shift instead of mod).
+ */
+ num_partitions_per_node = 1;
+
+ while (numa_nodes * num_partitions_per_node < MIN_BUFFER_PARTITIONS)
+ num_partitions_per_node++;
+
+ /* now we know the total number of partitions */
+ numa_partitions = (numa_nodes * num_partitions_per_node);
+
+ /*
+ * Finally, calculate how many buffers we'll assign to a single NUMA node.
+ * If we have only a single node, or can't map to that many nodes, just
+ * take a "fair share" of buffers.
+ *
+ * XXX In both cases the last node can get fewer buffers.
+ */
+ if (!numa_can_partition)
+ {
+ numa_buffers_per_node = (NBuffers + (numa_nodes - 1)) / numa_nodes;
+ }
+ else
+ {
+ numa_buffers_per_node = min_node_buffers;
+ while (numa_buffers_per_node * numa_nodes < NBuffers)
+ numa_buffers_per_node += min_node_buffers;
+
+ /* the last node should get at least some buffers */
+ Assert(NBuffers - (numa_nodes - 1) * numa_buffers_per_node > 0);
+ }
+
+ elog(LOG, "NUMA: buffers %d partitions %d num_nodes %d per_node %d buffers_per_node %d (min %d)",
+ NBuffers, numa_partitions, numa_nodes, num_partitions_per_node,
+ numa_buffers_per_node, min_node_buffers);
+}
+
+static void
+AssertCheckBufferPartitions(void)
+{
+#ifdef USE_ASSERT_CHECKING
+ int num_buffers = 0;
+
+ for (int i = 0; i < numa_partitions; i++)
+ {
+ BufferPartition *part = &BufferPartitionsArray->partitions[i];
+
+ /*
+ * We can get a single-buffer partition, if the sizing forces the last
+ * partition to be just one buffer. But it's unlikely (and
+ * undesirable).
+ */
+ Assert(part->first_buffer <= part->last_buffer);
+ Assert((part->last_buffer - part->first_buffer + 1) == part->num_buffers);
+
+ num_buffers += part->num_buffers;
+
+ /*
+ * The first partition needs to start on buffer 0. Later partitions
+ * need to be contiguous, without skipping any buffers.
+ */
+ if (i == 0)
+ {
+ Assert(part->first_buffer == 0);
+ }
+ else
+ {
+ BufferPartition *prev = &BufferPartitionsArray->partitions[i - 1];
+
+ Assert((part->first_buffer - 1) == prev->last_buffer);
+ }
+
+ /* the last partition needs to end on buffer (NBuffers - 1) */
+ if (i == (numa_partitions - 1))
+ {
+ Assert(part->last_buffer == (NBuffers - 1));
+ }
+ }
+
+ Assert(num_buffers == NBuffers);
+#endif
+}
+
+static void
+buffer_partitions_init(void)
+{
+ int remaining_buffers = NBuffers;
+ int buffer = 0;
+ int parts_per_node = (numa_partitions / numa_nodes);
+ char *buffers_ptr,
+ *descriptors_ptr;
+
+ BufferPartitionsArray->npartitions = numa_partitions;
+
+ for (int n = 0; n < numa_nodes; n++)
+ {
+ /* buffers this node should get (last node can get fewer) */
+ int node_buffers = Min(remaining_buffers, numa_buffers_per_node);
+
+ /* split node buffers netween partitions (last one can get fewer) */
+ int part_buffers = (node_buffers + (parts_per_node - 1)) / parts_per_node;
+
+ remaining_buffers -= node_buffers;
+
+ Assert((node_buffers > 0) && (node_buffers <= NBuffers));
+ Assert((n >= 0) && (n < numa_nodes));
+
+ for (int p = 0; p < parts_per_node; p++)
+ {
+ int idx = (n * parts_per_node) + p;
+ BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+ int num_buffers = Min(node_buffers, part_buffers);
+
+ Assert((idx >= 0) && (idx < numa_partitions));
+ Assert((buffer >= 0) && (buffer < NBuffers));
+ Assert((num_buffers > 0) && (num_buffers <= part_buffers));
+
+ /* XXX we should get the actual node ID from the mask */
+ part->numa_node = n;
+
+ part->num_buffers = num_buffers;
+ part->first_buffer = buffer;
+ part->last_buffer = buffer + (num_buffers - 1);
+
+ elog(LOG, "NUMA: buffer %d node %d partition %d buffers %d first %d last %d", idx, n, p, num_buffers, buffer, buffer + (num_buffers - 1));
+
+ buffer += num_buffers;
+ node_buffers -= part_buffers;
+ }
+ }
+
+ AssertCheckBufferPartitions();
+
+ /*
+ * With buffers interleaving disabled (or can't partition, because of
+ * shared buffers being too small), we're done.
+ */
+ if (!numa_buffers_interleave || !numa_can_partition)
+ return;
+
+ /*
+ * Assign chunks of buffers and buffer descriptors to the available NUMA
+ * nodes. We can't use the regular interleaving, because with regular
+ * memory pages (smaller than BLCKSZ) we'd split all buffers to multiple
+ * NUMA nodes. And we don't want that.
+ *
+ * But even with huge pages it seems like a good idea to not have mapping
+ * for each page.
+ *
+ * So we always assign a larger contiguous chunk of buffers to the same
+ * NUMA node, as calculated by choose_chunk_buffers(). We try to keep the
+ * chunks large enough to work both for buffers and buffer descriptors,
+ * but not too large. See the comments at choose_chunk_buffers() for
+ * details.
+ *
+ * Thanks to the earlier alignment (to memory page etc.), we know the
+ * buffers won't get split, etc.
+ *
+ * This also makes it easier / straightforward to calculate which NUMA
+ * node a buffer belongs to (it's a matter of divide + mod). See
+ * BufferGetNode().
+ *
+ * We need to account for partitions being of different length, when the
+ * NBuffers is not nicely divisible. To do that we keep track of the start
+ * of the next partition.
+ */
+ buffers_ptr = BufferBlocks;
+ descriptors_ptr = (char *) BufferDescriptors;
+
+ for (int i = 0; i < numa_partitions; i++)
+ {
+ BufferPartition *part = &BufferPartitionsArray->partitions[i];
+ char *startptr,
+ *endptr;
+
+ /* first map buffers */
+ startptr = buffers_ptr;
+ endptr = startptr + ((Size) part->num_buffers * BLCKSZ);
+ buffers_ptr = endptr; /* start of the next partition */
+
+ elog(LOG, "NUMA: buffer_partitions_init: %d => %d buffers %d start %p end %p (size %ld)",
+ i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+ pg_numa_move_to_node(startptr, endptr, part->numa_node);
+
+ /* now do the same for buffer descriptors */
+ startptr = descriptors_ptr;
+ endptr = startptr + ((Size) part->num_buffers * sizeof(BufferDescPadded));
+ descriptors_ptr = endptr;
+
+ elog(LOG, "NUMA: buffer_partitions_init: %d => %d descriptors %d start %p end %p (size %ld)",
+ i, part->numa_node, part->num_buffers, startptr, endptr, (endptr - startptr));
+
+ pg_numa_move_to_node(startptr, endptr, part->numa_node);
+ }
+
+ /* we should have consumed the arrays exactly */
+ Assert(buffers_ptr == BufferBlocks + (Size) NBuffers * BLCKSZ);
+ Assert(descriptors_ptr == (char *) BufferDescriptors + (Size) NBuffers * sizeof(BufferDescPadded));
+}
+
+int
+BufferPartitionCount(void)
+{
+ return BufferPartitionsArray->npartitions;
+}
+
+void
+BufferPartitionGet(int idx, int *node, int *num_buffers,
+ int *first_buffer, int *last_buffer)
+{
+ if ((idx >= 0) && (idx < BufferPartitionsArray->npartitions))
+ {
+ BufferPartition *part = &BufferPartitionsArray->partitions[idx];
+
+ *node = part->numa_node;
+ *num_buffers = part->num_buffers;
+ *first_buffer = part->first_buffer;
+ *last_buffer = part->last_buffer;
+
+ return;
+ }
+
+ elog(ERROR, "invalid partition index");
+}
+
+void
+BufferPartitionParams(int *num_partitions, int *num_nodes)
+{
+ *num_partitions = numa_partitions;
+ *num_nodes = numa_nodes;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int max_worker_processes = 8;
int max_parallel_workers = 8;
int MaxBackends = 0;
+/* NUMA stuff */
+bool numa_buffers_interleave = false;
+
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..9570087aa60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables NUMA interleaving of shared buffers."),
+ gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+ },
+ &numa_buffers_interleave,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..9dfbecb9fe4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -323,6 +323,7 @@ typedef struct WritebackContext
/* in buf_init.c */
extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT BufferPartitions *BufferPartitionsArray;
extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
extern PGDLLIMPORT WritebackContext BackendWritebackContext;
@@ -491,4 +492,9 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
extern void AtEOXact_LocalBuffers(bool isCommit);
+extern int BufferPartitionCount(void);
+extern void BufferPartitionGet(int idx, int *node, int *num_buffers,
+ int *first_buffer, int *last_buffer);
+extern void BufferPartitionParams(int *num_partitions, int *num_nodes);
+
#endif /* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..deaf4f19fa4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -143,6 +143,20 @@ struct ReadBuffersOperation
typedef struct ReadBuffersOperation ReadBuffersOperation;
+typedef struct BufferPartition
+{
+ int numa_node;
+ int num_buffers;
+ int first_buffer;
+ int last_buffer;
+} BufferPartition;
+
+typedef struct BufferPartitions
+{
+ int npartitions;
+ BufferPartition partitions[FLEXIBLE_ARRAY_MEMBER];
+} BufferPartitions;
+
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
@@ -319,6 +333,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
/* in buf_init.c */
extern void BufferManagerShmemInit(void);
extern Size BufferManagerShmemSize(void);
+extern int BufferGetNode(Buffer buffer);
/* in localbuf.c */
extern void AtProcExit_LocalBuffers(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3daba26b237..c695cfa76e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -346,6 +346,8 @@ BufferDescPadded
BufferHeapTupleTableSlot
BufferLookupEnt
BufferManagerRelation
+BufferPartition
+BufferPartitions
BufferStrategyControl
BufferTag
BufferUsage
--
2.50.1
v3-0007-NUMA-pin-backends-to-NUMA-nodes.patchtext/x-patch; charset=UTF-8; name=v3-0007-NUMA-pin-backends-to-NUMA-nodes.patchDownload
From 3b3a929b007205c46b3c193cf38e3edf71084af7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v3 7/7] NUMA: pin backends to NUMA nodes
When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
src/backend/storage/lmgr/proc.c | 21 +++++++++++++++++++++
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
4 files changed, 33 insertions(+)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 11259151a7d..dbb4cbb1bfa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -766,6 +766,27 @@ InitProcess(void)
}
MyProcNumber = GetNumberFromPGProc(MyProc);
+ /*
+ * Optionally, restrict the process to only run on CPUs from the same NUMA
+ * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+ * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+ *
+ * This also means we only do this with numa_procs_interleave, because
+ * without that we'll have numa_node=-1 for all PGPROC entries.
+ *
+ * FIXME add proper error-checking for libnuma functions
+ */
+ if (numa_procs_pin && MyProc->numa_node != -1)
+ {
+ struct bitmask *cpumask = numa_allocate_cpumask();
+
+ numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+ numa_sched_setaffinity(MyProcPid, cpumask);
+
+ numa_free_cpumask(cpumask);
+ }
+
/*
* Cross-check that the PGPROC is of the type we expect; if this were not
* the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ee4684d1b8..3f88659b49f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool numa_buffers_interleave = false;
bool numa_localalloc = false;
bool numa_partition_freelist = false;
bool numa_procs_interleave = false;
+bool numa_procs_pin = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7b718760248..862341e137e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2156,6 +2156,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+ gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+ },
+ &numa_procs_pin,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cdeee8dccba..a97741c6707 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT bool numa_partition_freelist;
extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
--
2.50.1
v3-0006-NUMA-interleave-PGPROC-entries.patchtext/x-patch; charset=UTF-8; name=v3-0006-NUMA-interleave-PGPROC-entries.patchDownload
From 5addb5973ce571debebf07b17adc07eb828a48ee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v3 6/7] NUMA: interleave PGPROC entries
The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.
We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.
Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.
Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.
To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.
The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):
- PGPROC array / node #0
- PGPROC array / node #1
- PGPROC array / aux processes + 2PC transactions
- fast-path arrays / node #0
- fast-path arrays / node #1
- fast-path arrays / aux processes + 2PC transaction
Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.
Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.
Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).
With the feature disabled, there's only a single "partition" for all
PGPROC entries.
Similarly to the buffer partitioning, this introduces a small "registry"
of partitions, as a source of truth. And then also a new "system" view
"pg_buffercache_pgproc" showing basic infromation abouut the partitions.
Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.
Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.
Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
.../pg_buffercache--1.6--1.7.sql | 19 +
contrib/pg_buffercache/pg_buffercache_pages.c | 94 +++
src/backend/access/transam/clog.c | 4 +-
src/backend/postmaster/pgarch.c | 2 +-
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/storage/buffer/buf_init.c | 2 -
src/backend/storage/buffer/freelist.c | 2 +-
src/backend/storage/ipc/procarray.c | 63 +-
src/backend/storage/lmgr/lock.c | 6 +-
src/backend/storage/lmgr/proc.c | 565 +++++++++++++++++-
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/miscadmin.h | 1 +
src/include/storage/proc.h | 14 +-
src/tools/pgindent/typedefs.list | 1 +
15 files changed, 722 insertions(+), 64 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index b7d8ea45ed7..c48950a9d3b 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -23,3 +23,22 @@ REVOKE ALL ON pg_buffercache_partitions FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pg_buffercache_partitions() TO pg_monitor;
GRANT SELECT ON pg_buffercache_partitions TO pg_monitor;
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_pgproc()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pgproc'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_pgproc AS
+ SELECT P.* FROM pg_buffercache_pgproc() AS P
+ (partition integer,
+ numa_node integer, num_procs integer, pgproc_ptr bigint, fastpath_ptr bigint);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pgproc() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_pgproc FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_pgproc() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_pgproc TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 5169655ae78..22396f36c09 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -15,6 +15,7 @@
#include "port/pg_numa.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/proc.h"
#include "utils/rel.h"
@@ -28,6 +29,7 @@
#define NUM_BUFFERCACHE_NUMA_ELEM 3
#define NUM_BUFFERCACHE_PARTITIONS_ELEM 11
+#define NUM_BUFFERCACHE_PGPROC_ELEM 5
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
@@ -102,6 +104,7 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict);
PG_FUNCTION_INFO_V1(pg_buffercache_evict_relation);
PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
PG_FUNCTION_INFO_V1(pg_buffercache_partitions);
+PG_FUNCTION_INFO_V1(pg_buffercache_pgproc);
/* Only need to touch memory once per backend process lifetime */
@@ -905,3 +908,94 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
else
SRF_RETURN_DONE(funcctx);
}
+
+/*
+ * Inquire about partitioning of PGPROC array.
+ */
+Datum
+pg_buffercache_pgproc(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ MemoryContext oldcontext;
+ TupleDesc tupledesc;
+ TupleDesc expected_tupledesc;
+ HeapTuple tuple;
+ Datum result;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Switch context when allocating stuff to be used in later calls */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ if (expected_tupledesc->natts != NUM_BUFFERCACHE_PGPROC_ELEM)
+ elog(ERROR, "incorrect number of output arguments");
+
+ /* Construct a tuple descriptor for the result rows. */
+ tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 1, "partition",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 2, "numa_node",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 3, "num_procs",
+ INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 4, "pgproc_ptr",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 5, "fastpath_ptr",
+ INT8OID, -1, 0);
+
+ funcctx->user_fctx = BlessTupleDesc(tupledesc);
+
+ /* Return to original context when allocating transient memory */
+ MemoryContextSwitchTo(oldcontext);
+
+ /* Set max calls and remember the user function context. */
+ funcctx->max_calls = ProcPartitionCount();
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ if (funcctx->call_cntr < funcctx->max_calls)
+ {
+ uint32 i = funcctx->call_cntr;
+
+ int numa_node,
+ num_procs;
+
+ void *pgproc_ptr,
+ *fastpath_ptr;
+
+ Datum values[NUM_BUFFERCACHE_PGPROC_ELEM];
+ bool nulls[NUM_BUFFERCACHE_PGPROC_ELEM];
+
+ ProcPartitionGet(i, &numa_node, &num_procs,
+ &pgproc_ptr, &fastpath_ptr);
+
+ values[0] = Int32GetDatum(i);
+ nulls[0] = false;
+
+ values[1] = Int32GetDatum(numa_node);
+ nulls[1] = false;
+
+ values[2] = Int32GetDatum(num_procs);
+ nulls[2] = false;
+
+ values[3] = PointerGetDatum(pgproc_ptr);
+ nulls[3] = false;
+
+ values[4] = PointerGetDatum(fastpath_ptr);
+ nulls[4] = false;
+
+ /* Build and return the tuple. */
+ tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ SRF_RETURN_NEXT(funcctx, result);
+ }
+ else
+ SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index e80fbe109cf..928d126d0ee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -574,7 +574,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ PGPROC *nextproc = ProcGlobal->allProcs[nextidx];
int64 thispageno = nextproc->clogGroupMemberPage;
/*
@@ -633,7 +633,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
*/
while (wakeidx != INVALID_PROC_NUMBER)
{
- PGPROC *wakeproc = &ProcGlobal->allProcs[wakeidx];
+ PGPROC *wakeproc = ProcGlobal->allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 78e39e5f866..e28e0f7d3bd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
* be relaunched shortly and will start archiving.
*/
if (arch_pgprocno != INVALID_PROC_NUMBER)
- SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
}
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 777c9a8d555..087279a6a8e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
LWLockRelease(WALSummarizerLock);
if (pgprocno != INVALID_PROC_NUMBER)
- SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
}
/*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 5b65a855b29..fb52039e1a6 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -500,8 +500,6 @@ buffer_partitions_prepare(void)
if (numa_nodes < 1)
numa_nodes = 1;
- elog(WARNING, "IsUnderPostmaster %d", IsUnderPostmaster);
-
/*
* XXX A bit weird. Do we need to worry about postmaster? Could this even
* run outside postmaster? I don't think so.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index ff02dc8e00b..d8d602f0a4e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -416,7 +416,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* actually fine because procLatch isn't ever freed, so we just can
* potentially set the wrong process' (or no process') latch.
*/
- SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+ SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf987aed8d3..3e86e4ca2ae 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
static ProcArrayStruct *procArray;
-static PGPROC *allProcs;
+static PGPROC **allProcs;
/*
* Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
int this_procno = arrayP->pgprocnos[index];
Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[this_procno].pgxactoff == index);
+ Assert(allProcs[this_procno]->pgxactoff == index);
/* If we have found our right position in the array, break */
if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
int procno = arrayP->pgprocnos[index];
Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[procno].pgxactoff == index - 1);
+ Assert(allProcs[procno]->pgxactoff == index - 1);
- allProcs[procno].pgxactoff = index;
+ allProcs[procno]->pgxactoff = index;
}
/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
myoff = proc->pgxactoff;
Assert(myoff >= 0 && myoff < arrayP->numProcs);
- Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+ Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
if (TransactionIdIsValid(latestXid))
{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
int procno = arrayP->pgprocnos[index];
Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
- Assert(allProcs[procno].pgxactoff - 1 == index);
+ Assert(allProcs[procno]->pgxactoff - 1 == index);
- allProcs[procno].pgxactoff = index;
+ allProcs[procno]->pgxactoff = index;
}
/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
/* Walk the list and clear all XIDs. */
while (nextidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &allProcs[nextidx];
+ PGPROC *nextproc = allProcs[nextidx];
ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
*/
while (wakeidx != INVALID_PROC_NUMBER)
{
- PGPROC *nextproc = &allProcs[wakeidx];
+ PGPROC *nextproc = allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
pxids = other_subxidstates[pgxactoff].count;
pg_read_barrier(); /* pairs with barrier in GetNewTransactionId() */
pgprocno = arrayP->pgprocnos[pgxactoff];
- proc = &allProcs[pgprocno];
+ proc = allProcs[pgprocno];
for (j = pxids - 1; j >= 0; j--)
{
/* Fetch xid just once - see GetNewTransactionId */
@@ -1622,7 +1622,6 @@ TransactionIdIsInProgress(TransactionId xid)
return false;
}
-
/*
* Determine XID horizons.
*
@@ -1740,7 +1739,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
for (int index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int8 statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
TransactionId xmin;
@@ -2224,7 +2223,7 @@ GetSnapshotData(Snapshot snapshot)
TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
uint8 statusFlags;
- Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+ Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
/*
* If the transaction has no XID assigned, we can skip it; it
@@ -2298,7 +2297,7 @@ GetSnapshotData(Snapshot snapshot)
if (nsubxids > 0)
{
int pgprocno = pgprocnos[pgxactoff];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
pg_read_barrier(); /* pairs with GetNewTransactionId */
@@ -2499,7 +2498,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
@@ -2725,7 +2724,7 @@ GetRunningTransactionData(void)
if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->databaseId == MyDatabaseId)
oldestDatabaseRunningXid = xid;
@@ -2756,7 +2755,7 @@ GetRunningTransactionData(void)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
int nsubxids;
/*
@@ -2858,7 +2857,7 @@ GetOldestActiveTransactionId(bool inCommitOnly, bool allDbs)
{
TransactionId xid;
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/* Fetch xid just once - see GetNewTransactionId */
xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -3020,7 +3019,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if ((proc->delayChkptFlags & type) != 0)
{
@@ -3061,7 +3060,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
VirtualTransactionId vxid;
GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3189,7 +3188,7 @@ BackendPidGetProcWithLock(int pid)
for (index = 0; index < arrayP->numProcs; index++)
{
- PGPROC *proc = &allProcs[arrayP->pgprocnos[index]];
+ PGPROC *proc = allProcs[arrayP->pgprocnos[index]];
if (proc->pid == pid)
{
@@ -3232,7 +3231,7 @@ BackendXidGetPid(TransactionId xid)
if (other_xids[index] == xid)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
result = proc->pid;
break;
@@ -3301,7 +3300,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc == MyProc)
@@ -3403,7 +3402,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/* Exclude prepared transactions */
if (proc->pid == 0)
@@ -3468,7 +3467,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3523,7 +3522,7 @@ MinimumActiveBackends(int min)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
/*
* Since we're not holding a lock, need to be prepared to deal with
@@ -3569,7 +3568,7 @@ CountDBBackends(Oid databaseid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3598,7 +3597,7 @@ CountDBConnections(Oid databaseid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3629,7 +3628,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (databaseid == InvalidOid || proc->databaseId == databaseid)
{
@@ -3670,7 +3669,7 @@ CountUserBackends(Oid roleid)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->pid == 0)
continue; /* do not count prepared xacts */
@@ -3733,7 +3732,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
for (index = 0; index < arrayP->numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc->databaseId != databaseId)
@@ -3799,7 +3798,7 @@ TerminateOtherDBBackends(Oid databaseId)
for (i = 0; i < procArray->numProcs; i++)
{
int pgprocno = arrayP->pgprocnos[i];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc = allProcs[pgprocno];
if (proc->databaseId != databaseId)
continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..c84a2a5f1bc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
*/
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
uint32 j;
LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
*/
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
uint32 j;
/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
*/
for (i = 0; i < ProcGlobal->allProcCount; ++i)
{
- PGPROC *proc = &ProcGlobal->allProcs[i];
+ PGPROC *proc = ProcGlobal->allProcs[i];
/* Skip backends with pid=0, as they don't hold fast-path locks */
if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..11259151a7d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
*/
#include "postgres.h"
+#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_numa.h"
#include "postmaster/autovacuum.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
#include "storage/condition_variable.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
@@ -90,6 +98,31 @@ static void AuxiliaryProcKill(int code, Datum arg);
static void CheckDeadLock(void);
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int numa_nodes = -1; /* number of nodes when sizing */
+static Size numa_page_size = 0; /* page used to size partitions */
+static bool numa_can_partition = false; /* can map to NUMA nodes? */
+static int numa_procs_per_node = -1; /* pgprocs per node */
+
+static Size get_memory_page_size(void); /* XXX duplicate with bufi_init.c */
+
+static void pgproc_partitions_prepare(void);
+static char *pgproc_partition_init(char *ptr, int num_procs,
+ int allprocs_index, int node);
+static char *fastpath_partition_init(char *ptr, int num_procs,
+ int allprocs_index, int node,
+ Size fpLockBitsSize, Size fpRelIdSize);
+
+typedef struct PGProcPartition
+{
+ int num_procs;
+ int numa_node;
+ void *pgproc_ptr;
+ void *fastpath_ptr;
+} PGProcPartition;
+
+static PGProcPartition *partitions = NULL;
+
/*
* Report shared-memory space needed by PGPROC.
*/
@@ -100,11 +133,63 @@ PGProcShmemSize(void)
Size TotalProcs =
add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
+ /*
+ * To support NUMA partitioning, the PGPROC array will be divided into
+ * multiple chunks - one per NUMA node, and one extra for auxiliary/2PC
+ * entries (which are not assigned to any NUMA node).
+ *
+ * We can't simply map pages of a single continuous array, because the
+ * PGPROC entries are very small and too many of them would fit on a
+ * single page (at least with huge pages). Far more than reasonable values
+ * of max_connections. So instead we cut the array into separate pieces
+ * for each node.
+ *
+ * Each piece may need up to one memory page of padding, to make it
+ * aligned with memory page (for NUMA), So we just add a page - it's a bit
+ * wasteful, but should not matter much - NUMA is meant for large boxes,
+ * so a couple pages is negligible.
+ *
+ * We only do this with NUMA partitioning. With the GUC disabled, or when
+ * we find we can't do that for some reason, we just allocate the PGPROC
+ * array as a single chunk. This is determined by the earlier call to
+ * pgproc_partitions_prepare().
+ *
+ * XXX It might be more painful with very large huge pages (e.g. 1GB).
+ */
+
+ /*
+ * If PGPROC partitioning is enabled, and we decided it's possible, we
+ * need to add one memory page per NUMA node (and one for auxiliary/2PC
+ * processes) to allow proper alignment.
+ *
+ * XXX This is a a bit wasteful, because it might actually add pages even
+ * when not strictly needed (if it's already aligned). And we always
+ * assume we'll add a whole page, even if the alignment needs only less
+ * memory.
+ */
+ if (numa_procs_interleave && numa_can_partition)
+ {
+ Assert(numa_nodes > 0);
+ size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+
+ /*
+ * Also account for a small registry of partitions, a simple array of
+ * partitions at the beginning.
+ */
+ size = add_size(size, mul_size((numa_nodes + 1), sizeof(PGProcPartition)));
+ }
+ else
+ {
+ /* otherwise add only a tiny registry, with a single partition */
+ size = add_size(size, sizeof(PGProcPartition));
+ }
+
return size;
}
@@ -129,6 +214,25 @@ FastPathLockShmemSize(void)
size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
+ /*
+ * When applying NUMA to the fast-path locks, we follow the same logic as
+ * for PGPROC entries. See the comments in PGProcShmemSize().
+ *
+ * If PGPROC partitioning is enabled, and we decided it's possible, we
+ * need to add one memory page per NUMA node (and one for auxiliary/2PC
+ * processes) to allow proper alignment.
+ *
+ * XXX This is a a bit wasteful, because it might actually add pages even
+ * when not strictly needed (if it's already aligned). And we always
+ * assume we'll add a whole page, even if the alignment needs only less
+ * memory.
+ */
+ if (numa_procs_interleave && numa_can_partition)
+ {
+ Assert(numa_nodes > 0);
+ size = add_size(size, mul_size((numa_nodes + 1), numa_page_size));
+ }
+
return size;
}
@@ -140,6 +244,9 @@ ProcGlobalShmemSize(void)
{
Size size = 0;
+ /* calculate partition info for pgproc entries etc */
+ pgproc_partitions_prepare();
+
/* ProcGlobal */
size = add_size(size, sizeof(PROC_HDR));
size = add_size(size, sizeof(slock_t));
@@ -191,7 +298,7 @@ ProcGlobalSemas(void)
void
InitProcGlobal(void)
{
- PGPROC *procs;
+ PGPROC **procs;
int i,
j;
bool found;
@@ -205,6 +312,8 @@ InitProcGlobal(void)
Size requestSize;
char *ptr;
+ Size mem_page_size = get_memory_page_size();
+
/* Create the ProcGlobal shared structure */
ProcGlobal = (PROC_HDR *)
ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -241,19 +350,115 @@ InitProcGlobal(void)
MemSet(ptr, 0, requestSize);
- procs = (PGPROC *) ptr;
- ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+ /* allprocs (array of pointers to PGPROC entries) */
+ procs = (PGPROC **) ptr;
+ ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
ProcGlobal->allProcs = procs;
/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
+ /*
+ * If NUMA partitioning is enabled, and we decided we actually can do the
+ * partitioning, allocate the chunks.
+ *
+ * Otherwise we'll allocate a single array for everything. It's not quite
+ * what we did without NUMA, because there's an extra level of
+ * indirection, but it's the best we can do.
+ */
+ if (numa_procs_interleave && numa_can_partition)
+ {
+ int node_procs;
+ int total_procs = 0;
+
+ Assert(numa_procs_per_node > 0);
+ Assert(numa_nodes > 0);
+
+ /*
+ * Now initialize the PGPROC partition registry with one partitoion
+ * per NUMA node.
+ */
+ partitions = (PGProcPartition *) ptr;
+ ptr += (numa_nodes * sizeof(PGProcPartition));
+
+ /* build PGPROC entries for NUMA nodes */
+ for (i = 0; i < numa_nodes; i++)
+ {
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+ /* make sure to align the PGPROC array to memory page */
+ ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+ /* fill in the partition info */
+ partitions[i].num_procs = node_procs;
+ partitions[i].numa_node = i;
+ partitions[i].pgproc_ptr = ptr;
+
+ ptr = pgproc_partition_init(ptr, node_procs, total_procs, i);
+
+ total_procs += node_procs;
+
+ /* don't underflow/overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+ }
+
+ Assert(total_procs == MaxBackends);
+
+ /*
+ * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+ * however don't assign those to any NUMA node).
+ */
+ node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+ /* make sure to align the PGPROC array to memory page */
+ ptr = (char *) TYPEALIGN(numa_page_size, ptr);
+
+ /* fill in the partition info */
+ partitions[numa_nodes].num_procs = node_procs;
+ partitions[numa_nodes].numa_node = -1;
+ partitions[numa_nodes].pgproc_ptr = ptr;
+
+ ptr = pgproc_partition_init(ptr, node_procs, total_procs, -1);
+
+ total_procs += node_procs;
+
+ /* don't overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+ Assert(total_procs = TotalProcs);
+ }
+ else
+ {
+ /*
+ * Now initialize the PGPROC partition registry with a single
+ * partition for all the procs.
+ */
+ partitions = (PGProcPartition *) ptr;
+ ptr += sizeof(PGProcPartition);
+
+ /* just treat everything as a single array, with no alignment */
+ ptr = pgproc_partition_init(ptr, TotalProcs, 0, -1);
+
+ /* fill in the partition info */
+ partitions[0].num_procs = TotalProcs;
+ partitions[0].numa_node = -1;
+ partitions[0].pgproc_ptr = ptr;
+
+ /* don't overflow the allocation */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+ }
+
/*
* Allocate arrays mirroring PGPROC fields in a dense manner. See
* PROC_HDR.
*
* XXX: It might make sense to increase padding for these arrays, given
* how hotly they are accessed.
+ *
+ * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+ * But those arrays are tiny, fit into a single memory page, so would need
+ * to be made more complex. Not sure.
*/
ProcGlobal->xids = (TransactionId *) ptr;
ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,24 +491,92 @@ InitProcGlobal(void)
/* For asserts checking we did not overflow. */
fpEndPtr = fpPtr + requestSize;
- for (i = 0; i < TotalProcs; i++)
+ /*
+ * Mimic the logic we used to partition PGPROC entries.
+ */
+
+ /*
+ * If NUMA partitioning is enabled, and we decided we actually can do the
+ * partitioning, allocate the chunks.
+ *
+ * Otherwise we'll allocate a single array for everything. It's not quite
+ * what we did without NUMA, because there's an extra level of
+ * indirection, but it's the best we can do.
+ */
+ if (numa_procs_interleave && numa_can_partition)
{
- PGPROC *proc = &procs[i];
+ int node_procs;
+ int total_procs = 0;
+
+ Assert(numa_procs_per_node > 0);
+
+ /* build PGPROC entries for NUMA nodes */
+ for (i = 0; i < numa_nodes; i++)
+ {
+ /* the last NUMA node may get fewer PGPROC entries, but meh */
+ node_procs = Min(numa_procs_per_node, MaxBackends - total_procs);
+
+ /* make sure to align the PGPROC array to memory page */
+ fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
- /* Common initialization for all PGPROCs, regardless of type. */
+ /* remember this pointer too */
+ partitions[i].fastpath_ptr = fpPtr;
+ Assert(node_procs == partitions[i].num_procs);
+
+ fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, i,
+ fpLockBitsSize, fpRelIdSize);
+
+ total_procs += node_procs;
+
+ /* don't overflow the allocation */
+ Assert(fpPtr <= fpEndPtr);
+ }
+
+ Assert(total_procs == MaxBackends);
/*
- * Set the fast-path lock arrays, and move the pointer. We interleave
- * the two arrays, to (hopefully) get some locality for each backend.
+ * Also build PGPROC entries for auxiliary procs / prepared xacts (we
+ * however don't assign those to any NUMA node).
*/
- proc->fpLockBits = (uint64 *) fpPtr;
- fpPtr += fpLockBitsSize;
+ node_procs = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
- proc->fpRelId = (Oid *) fpPtr;
- fpPtr += fpRelIdSize;
+ /* make sure to align the PGPROC array to memory page */
+ fpPtr = (char *) TYPEALIGN(numa_page_size, fpPtr);
+ /* remember this pointer too */
+ partitions[numa_nodes].fastpath_ptr = fpPtr;
+ Assert(node_procs == partitions[numa_nodes].num_procs);
+
+ fpPtr = fastpath_partition_init(fpPtr, node_procs, total_procs, -1,
+ fpLockBitsSize, fpRelIdSize);
+
+ total_procs += node_procs;
+
+ /* don't overflow the allocation */
Assert(fpPtr <= fpEndPtr);
+ Assert(total_procs = TotalProcs);
+ }
+ else
+ {
+ /* remember this pointer too */
+ partitions[0].fastpath_ptr = fpPtr;
+ Assert(TotalProcs == partitions[0].num_procs);
+
+ /* just treat everything as a single array, with no alignment */
+ fpPtr = fastpath_partition_init(fpPtr, TotalProcs, 0, -1,
+ fpLockBitsSize, fpRelIdSize);
+
+ /* don't overflow the allocation */
+ Assert(fpPtr <= fpEndPtr);
+ }
+
+ for (i = 0; i < TotalProcs; i++)
+ {
+ PGPROC *proc = procs[i];
+
+ Assert(proc->procnumber == i);
+
/*
* Set up per-PGPROC semaphore, latch, and fpInfoLock. Prepared xact
* dummy PGPROCs don't need these though - they're never associated
@@ -366,15 +639,12 @@ InitProcGlobal(void)
pg_atomic_init_u64(&(proc->waitStart), 0);
}
- /* Should have consumed exactly the expected amount of fast-path memory. */
- Assert(fpPtr == fpEndPtr);
-
/*
* Save pointers to the blocks of PGPROC structures reserved for auxiliary
* processes and prepared transactions.
*/
- AuxiliaryProcs = &procs[MaxBackends];
- PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+ AuxiliaryProcs = procs[MaxBackends];
+ PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
/* Create ProcStructLock spinlock, too */
ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +705,45 @@ InitProcess(void)
if (!dlist_is_empty(procgloballist))
{
- MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+ /*
+ * With numa interleaving of PGPROC, try to get a PROC entry from the
+ * right NUMA node (when the process starts).
+ *
+ * XXX The process may move to a different NUMA node later, but
+ * there's not much we can do about that.
+ */
+ if (numa_procs_interleave)
+ {
+ dlist_mutable_iter iter;
+ unsigned cpu;
+ unsigned node;
+ int rc;
+
+ rc = getcpu(&cpu, &node);
+ if (rc != 0)
+ elog(ERROR, "getcpu failed: %m");
+
+ MyProc = NULL;
+
+ dlist_foreach_modify(iter, procgloballist)
+ {
+ PGPROC *proc;
+
+ proc = dlist_container(PGPROC, links, iter.cur);
+
+ if (proc->numa_node == node)
+ {
+ MyProc = proc;
+ dlist_delete(iter.cur);
+ break;
+ }
+ }
+ }
+
+ /* didn't find PGPROC from the correct NUMA node, pick any free one */
+ if (MyProc == NULL)
+ MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
SpinLockRelease(ProcStructLock);
}
else
@@ -1988,7 +2296,7 @@ ProcSendSignal(ProcNumber procNumber)
if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
elog(ERROR, "procNumber out of range");
- SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+ SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
}
/*
@@ -2063,3 +2371,222 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+ Size os_page_size;
+ Size huge_page_size;
+
+#ifdef WIN32
+ SYSTEM_INFO sysinfo;
+
+ GetSystemInfo(&sysinfo);
+ os_page_size = sysinfo.dwPageSize;
+#else
+ os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+ /*
+ * XXX This is a bit annoying/confusing, because we may get a different
+ * result depending on when we call it. Before mmap() we don't know if the
+ * huge pages get used, so we assume they will. And then if we don't get
+ * huge pages, we'll waste memory etc.
+ */
+
+ /* assume huge pages get used, unless HUGE_PAGES_OFF */
+ if (huge_pages_status == HUGE_PAGES_OFF)
+ huge_page_size = 0;
+ else
+ GetHugePageSize(&huge_page_size, NULL);
+
+ return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * pgproc_partitions_prepare
+ * Calculate parameters for partitioning buffers.
+ *
+ * NUMA partitioning
+ *
+ * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+ * extra for auxiliary processes and 2PC transactions, not associated with
+ * any particular node).
+ *
+ * First determine how many "backend" procs to allocate per NUMA node. The
+ * count may not be exactly divisible, but we mostly ignore that. The last
+ * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+ * be pretty small (if MaxBackends >> numa_nodes).
+ *
+ * XXX A fairer distribution is possible, but not worth it for now.
+ */
+static void
+pgproc_partitions_prepare(void)
+{
+ /* bail out if already initialized (calculate only once) */
+ if (numa_nodes != -1)
+ return;
+
+ /* XXX only gives us the number, the nodes may not be 0, 1, 2, ... */
+ numa_nodes = numa_num_configured_nodes();
+
+ /* XXX can this happen? */
+ if (numa_nodes < 1)
+ numa_nodes = 1;
+
+ /*
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ *
+ * XXX Another issue is we may get different values than when sizing the
+ * the memory, because at that point we didn't know if we get huge pages,
+ * so we assumed we will. Shouldn't cause crashes, but we might allocate
+ * shared memory and then not use some of it (because of the alignment
+ * that we don't actually need). Not sure about better way, good for now.
+ */
+ if (IsUnderPostmaster)
+ numa_page_size = pg_get_shmem_pagesize();
+ else
+ numa_page_size = get_memory_page_size();
+
+ numa_procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+
+ elog(LOG, "NUMA: pgproc backends %d num_nodes %d per_node %d",
+ MaxBackends, numa_nodes, numa_procs_per_node);
+
+ Assert(numa_nodes * numa_procs_per_node >= MaxBackends);
+
+ /* success */
+ numa_can_partition = true;
+}
+
+static void
+pg_numa_move_to_node(char *startptr, char *endptr, int node)
+{
+ Size mem_page_size;
+ Size sz;
+
+ /*
+ * Get the "actual" memory page size, not the one we used for sizing. We
+ * might have used huge page for sizing, but only get regular pages when
+ * allocating, so we must use the smaller pages here.
+ *
+ * XXX A bit weird. Do we need to worry about postmaster? Could this even
+ * run outside postmaster? I don't think so.
+ */
+ if (IsUnderPostmaster)
+ mem_page_size = pg_get_shmem_pagesize();
+ else
+ mem_page_size = get_memory_page_size();
+
+ Assert((int64) startptr % mem_page_size == 0);
+
+ sz = (endptr - startptr);
+ numa_tonode_memory(startptr, sz, node);
+}
+
+/*
+ * doesn't do alignment
+ */
+static char *
+pgproc_partition_init(char *ptr, int num_procs, int allprocs_index, int node)
+{
+ PGPROC *procs_node;
+
+ /* allocate the PGPROC chunk for this node */
+ procs_node = (PGPROC *) ptr;
+
+ /* pointer right after this array */
+ ptr = (char *) ptr + num_procs * sizeof(PGPROC);
+
+ elog(LOG, "NUMA: pgproc_init_partition procs %p endptr %p num_procs %d node %d",
+ procs_node, ptr, num_procs, node);
+
+ /*
+ * if node specified, move to node - do this before we start touching the
+ * memory, to make sure it's not mapped to any node yet
+ */
+ if (node != -1)
+ pg_numa_move_to_node((char *) procs_node, ptr, node);
+
+ /* add pointers to the PGPROC entries to allProcs */
+ for (int i = 0; i < num_procs; i++)
+ {
+ procs_node[i].numa_node = node;
+ procs_node[i].procnumber = allprocs_index;
+
+ ProcGlobal->allProcs[allprocs_index] = &procs_node[i];
+
+ allprocs_index++;
+ }
+
+ return ptr;
+}
+
+static char *
+fastpath_partition_init(char *ptr, int num_procs, int allprocs_index, int node,
+ Size fpLockBitsSize, Size fpRelIdSize)
+{
+ char *endptr = ptr + num_procs * (fpLockBitsSize + fpRelIdSize);
+
+ /*
+ * if node specified, move to node - do this before we start touching the
+ * memory, to make sure it's not mapped to any node yet
+ */
+ if (node != -1)
+ pg_numa_move_to_node(ptr, endptr, node);
+
+ /*
+ * Now point the PGPROC entries to the fast-path arrays, and also advance
+ * the fpPtr.
+ */
+ for (int i = 0; i < num_procs; i++)
+ {
+ PGPROC *proc = ProcGlobal->allProcs[allprocs_index];
+
+ /* cross-check we got the expected NUMA node */
+ Assert(proc->numa_node == node);
+ Assert(proc->procnumber == allprocs_index);
+
+ /*
+ * Set the fast-path lock arrays, and move the pointer. We interleave
+ * the two arrays, to (hopefully) get some locality for each backend.
+ */
+ proc->fpLockBits = (uint64 *) ptr;
+ ptr += fpLockBitsSize;
+
+ proc->fpRelId = (Oid *) ptr;
+ ptr += fpRelIdSize;
+
+ Assert(ptr <= endptr);
+
+ allprocs_index++;
+ }
+
+ Assert(ptr == endptr);
+
+ return endptr;
+}
+
+int
+ProcPartitionCount(void)
+{
+ if (numa_procs_interleave && numa_can_partition)
+ return (numa_nodes + 1);
+
+ return 1;
+}
+
+void
+ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr)
+{
+ PGProcPartition *part = &partitions[idx];
+
+ Assert((idx >= 0) && (idx < ProcPartitionCount()));
+
+ *nprocs = part->num_procs;
+ *procsptr = part->pgproc_ptr;
+ *fpptr = part->fastpath_ptr;
+ *node = part->numa_node;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a11bc71a386..6ee4684d1b8 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int MaxBackends = 0;
bool numa_buffers_interleave = false;
bool numa_localalloc = false;
bool numa_partition_freelist = false;
+bool numa_procs_interleave = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0552ed62cc7..7b718760248 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2146,6 +2146,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+ gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+ },
+ &numa_procs_interleave,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 66baf2bf33e..cdeee8dccba 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT bool numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..d2d269941fc 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -202,6 +202,8 @@ struct PGPROC
* vacuum must not remove tuples deleted by
* xid >= xmin ! */
+ int procnumber; /* index in ProcGlobal->allProcs */
+
int pid; /* Backend's process ID; 0 if prepared xact */
int pgxactoff; /* offset into various ProcGlobal->arrays with
@@ -327,6 +329,9 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /* NUMA node */
+ int numa_node;
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -391,7 +396,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
typedef struct PROC_HDR
{
/* Array of PGPROC structures (not including dummies for prepared txns) */
- PGPROC *allProcs;
+ PGPROC **allProcs;
/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
TransactionId *xids;
@@ -443,8 +448,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
/*
* Accessors for getting PGPROC given a ProcNumber and vice versa.
*/
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
/*
* We set aside some extra PGPROC structures for "special worker" processes,
@@ -520,4 +525,7 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern int ProcPartitionCount(void);
+extern void ProcPartitionGet(int idx, int *node, int *nprocs, void **procsptr, void **fpptr);
+
#endif /* _PROC_H_ */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a38dd8d6242..5595cd48eee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1877,6 +1877,7 @@ PGP_MPI
PGP_PubKey
PGP_S2K
PGPing
+PGProcPartition
PGQueryClass
PGRUsage
PGSemaphore
--
2.50.1
v3-0005-NUMA-clockweep-partitioning.patchtext/x-patch; charset=UTF-8; name=v3-0005-NUMA-clockweep-partitioning.patchDownload
From 2bfd4a824b12e9a865c5ef0a8ed33e215fb1b698 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 8 Jun 2025 18:53:12 +0200
Subject: [PATCH v3 5/7] NUMA: clockweep partitioning
Similar to the frelist patch - partition the "clocksweep" algorithm to
work on the sequence of smaller partitions, one by one.
It extends the "pg_buffercache_partitions" view to include information
about the clocksweep activity.
Note: This needs some sort of "balancing" when one of the partitions is
much busier than the rest (e.g. because there's a single backend consuming
a lot of buffers from it).
---
.../pg_buffercache--1.6--1.7.sql | 5 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 24 +-
src/backend/storage/buffer/bufmgr.c | 476 ++++++++++--------
src/backend/storage/buffer/freelist.c | 224 +++++++--
src/include/storage/buf_internals.h | 4 +-
src/include/storage/bufmgr.h | 5 +-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 478 insertions(+), 261 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index 3871c261528..b7d8ea45ed7 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -12,7 +12,10 @@ LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
CREATE VIEW pg_buffercache_partitions AS
SELECT P.* FROM pg_buffercache_partitions() AS P
- (partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer, buffers_consumed bigint, buffers_remain bigint, buffers_free bigint);
+ (partition integer,
+ numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer,
+ buffers_consumed bigint, buffers_remain bigint, buffers_free bigint,
+ complete_passes bigint, buffer_allocs bigint, next_victim_buffer integer);
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 668ada8c47b..5169655ae78 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
#define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
#define NUM_BUFFERCACHE_NUMA_ELEM 3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM 8
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM 11
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
@@ -818,6 +818,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
INT8OID, -1, 0);
TupleDescInitEntry(tupledesc, (AttrNumber) 8, "buffers_free",
INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 9, "complete_passes",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 10, "buffer_allocs",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 11, "next_victim_buffer",
+ INT4OID, -1, 0);
funcctx->user_fctx = BlessTupleDesc(tupledesc);
@@ -843,6 +849,10 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
buffers_remain,
buffers_free;
+ uint32 complete_passes,
+ buffer_allocs,
+ next_victim_buffer;
+
Datum values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
bool nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
@@ -850,7 +860,8 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
&first_buffer, &last_buffer);
FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
- &buffers_free);
+ &buffers_free, &complete_passes,
+ &buffer_allocs, &next_victim_buffer);
values[0] = Int32GetDatum(i);
nulls[0] = false;
@@ -876,6 +887,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
values[7] = Int64GetDatum(buffers_free);
nulls[7] = false;
+ values[8] = Int32GetDatum(complete_passes);
+ nulls[8] = false;
+
+ values[9] = Int32GetDatum(buffer_allocs);
+ nulls[9] = false;
+
+ values[10] = Int32GetDatum(next_victim_buffer);
+ nulls[10] = false;
+
/* Build and return the tuple. */
tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5922689fe5d..bd007c1c621 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3587,6 +3587,23 @@ BufferSync(int flags)
TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
}
+/*
+ * Information saved between calls so we can determine the strategy
+ * point's advance rate and avoid scanning already-cleaned buffers.
+ *
+ * XXX One value per partition. We don't know how many partitions are
+ * there, so allocate 32, should be enough for the PoC patch.
+ *
+ * XXX might be better to have a per-partition struct with all the info
+ */
+#define MAX_CLOCKSWEEP_PARTITIONS 32
+static bool saved_info_valid = false;
+static int prev_strategy_buf_id[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 prev_strategy_passes[MAX_CLOCKSWEEP_PARTITIONS];
+static int next_to_clean[MAX_CLOCKSWEEP_PARTITIONS];
+static uint32 next_passes[MAX_CLOCKSWEEP_PARTITIONS];
+
+
/*
* BgBufferSync -- Write out some dirty buffers in the pool.
*
@@ -3602,55 +3619,24 @@ bool
BgBufferSync(WritebackContext *wb_context)
{
/* info obtained from freelist.c */
- int strategy_buf_id;
- uint32 strategy_passes;
uint32 recent_alloc;
+ uint32 recent_alloc_partition;
+ int num_partitions;
- /*
- * Information saved between calls so we can determine the strategy
- * point's advance rate and avoid scanning already-cleaned buffers.
- */
- static bool saved_info_valid = false;
- static int prev_strategy_buf_id;
- static uint32 prev_strategy_passes;
- static int next_to_clean;
- static uint32 next_passes;
-
- /* Moving averages of allocation rate and clean-buffer density */
- static float smoothed_alloc = 0;
- static float smoothed_density = 10.0;
-
- /* Potentially these could be tunables, but for now, not */
- float smoothing_samples = 16;
- float scan_whole_pool_milliseconds = 120000.0;
-
- /* Used to compute how far we scan ahead */
- long strategy_delta;
- int bufs_to_lap;
- int bufs_ahead;
- float scans_per_alloc;
- int reusable_buffers_est;
- int upcoming_alloc_est;
- int min_scan_buffers;
-
- /* Variables for the scanning loop proper */
- int num_to_scan;
- int num_written;
- int reusable_buffers;
+ /* assume we can hibernate, any partition can set to false */
+ bool hibernate = true;
- /* Variables for final smoothed_density update */
- long new_strategy_delta;
- uint32 new_recent_alloc;
+ /* get the number of clocksweep partitions, and total alloc count */
+ StrategySyncPrepare(&num_partitions, &recent_alloc);
- /*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
- */
- strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
+ Assert(num_partitions <= MAX_CLOCKSWEEP_PARTITIONS);
/* Report buffer alloc counts to pgstat */
PendingBgWriterStats.buf_alloc += recent_alloc;
+ /* average alloc buffers per partition */
+ recent_alloc_partition = (recent_alloc / num_partitions);
+
/*
* If we're not running the LRU scan, just stop after doing the stats
* stuff. We mark the saved state invalid so that we can recover sanely
@@ -3663,223 +3649,285 @@ BgBufferSync(WritebackContext *wb_context)
}
/*
- * Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
- * buffers we could scan before we'd catch up with it and "lap" it. Note:
- * weird-looking coding of xxx_passes comparisons are to avoid bogus
- * behavior when the passes counts wrap around.
- */
- if (saved_info_valid)
- {
- int32 passes_delta = strategy_passes - prev_strategy_passes;
-
- strategy_delta = strategy_buf_id - prev_strategy_buf_id;
- strategy_delta += (long) passes_delta * NBuffers;
+ * now process the clocksweep partitions, one by one, using the same
+ * cleanup that we used for all buffers
+ *
+ * XXX Maybe we should randomize the order of partitions a bit, so that we
+ * don't start from partition 0 all the time? Perhaps not entirely, but at
+ * least pick a random starting point?
+ */
+ for (int partition = 0; partition < num_partitions; partition++)
+ {
+ /* info obtained from freelist.c */
+ int strategy_buf_id;
+ uint32 strategy_passes;
+
+ /* Moving averages of allocation rate and clean-buffer density */
+ static float smoothed_alloc = 0;
+ static float smoothed_density = 10.0;
+
+ /* Potentially these could be tunables, but for now, not */
+ float smoothing_samples = 16;
+ float scan_whole_pool_milliseconds = 120000.0;
+
+ /* Used to compute how far we scan ahead */
+ long strategy_delta;
+ int bufs_to_lap;
+ int bufs_ahead;
+ float scans_per_alloc;
+ int reusable_buffers_est;
+ int upcoming_alloc_est;
+ int min_scan_buffers;
+
+ /* Variables for the scanning loop proper */
+ int num_to_scan;
+ int num_written;
+ int reusable_buffers;
+
+ /* Variables for final smoothed_density update */
+ long new_strategy_delta;
+ uint32 new_recent_alloc;
+
+ /* buffer range for the clocksweep partition */
+ int first_buffer;
+ int num_buffers;
- Assert(strategy_delta >= 0);
+ /*
+ * Find out where the freelist clock sweep currently is, and how many
+ * buffer allocations have happened since our last call.
+ */
+ strategy_buf_id = StrategySyncStart(partition, &strategy_passes,
+ &first_buffer, &num_buffers);
- if ((int32) (next_passes - strategy_passes) > 0)
+ /*
+ * Compute strategy_delta = how many buffers have been scanned by the
+ * clock sweep since last time. If first time through, assume none.
+ * Then see if we are still ahead of the clock sweep, and if so, how
+ * many buffers we could scan before we'd catch up with it and "lap"
+ * it. Note: weird-looking coding of xxx_passes comparisons are to
+ * avoid bogus behavior when the passes counts wrap around.
+ */
+ if (saved_info_valid)
{
- /* we're one pass ahead of the strategy point */
- bufs_to_lap = strategy_buf_id - next_to_clean;
+ int32 passes_delta = strategy_passes - prev_strategy_passes[partition];
+
+ strategy_delta = strategy_buf_id - prev_strategy_buf_id[partition];
+ strategy_delta += (long) passes_delta * num_buffers;
+
+ Assert(strategy_delta >= 0);
+
+ if ((int32) (next_passes[partition] - strategy_passes) > 0)
+ {
+ /* we're one pass ahead of the strategy point */
+ bufs_to_lap = strategy_buf_id - next_to_clean[partition];
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
- next_passes, next_to_clean,
- strategy_passes, strategy_buf_id,
- strategy_delta, bufs_to_lap);
+ elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+ next_passes, next_to_clean,
+ strategy_passes, strategy_buf_id,
+ strategy_delta, bufs_to_lap);
#endif
- }
- else if (next_passes == strategy_passes &&
- next_to_clean >= strategy_buf_id)
- {
- /* on same pass, but ahead or at least not behind */
- bufs_to_lap = NBuffers - (next_to_clean - strategy_buf_id);
+ }
+ else if (next_passes[partition] == strategy_passes &&
+ next_to_clean[partition] >= strategy_buf_id)
+ {
+ /* on same pass, but ahead or at least not behind */
+ bufs_to_lap = num_buffers - (next_to_clean[partition] - strategy_buf_id);
+#ifdef BGW_DEBUG
+ elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
+ next_passes, next_to_clean,
+ strategy_passes, strategy_buf_id,
+ strategy_delta, bufs_to_lap);
+#endif
+ }
+ else
+ {
+ /*
+ * We're behind, so skip forward to the strategy point and
+ * start cleaning from there.
+ */
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter ahead: bgw %u-%u strategy %u-%u delta=%ld lap=%d",
- next_passes, next_to_clean,
- strategy_passes, strategy_buf_id,
- strategy_delta, bufs_to_lap);
+ elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
+ next_passes, next_to_clean,
+ strategy_passes, strategy_buf_id,
+ strategy_delta);
#endif
+ next_to_clean[partition] = strategy_buf_id;
+ next_passes[partition] = strategy_passes;
+ bufs_to_lap = num_buffers;
+ }
}
else
{
/*
- * We're behind, so skip forward to the strategy point and start
- * cleaning from there.
+ * Initializing at startup or after LRU scanning had been off.
+ * Always start at the strategy point.
*/
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
- next_passes, next_to_clean,
- strategy_passes, strategy_buf_id,
- strategy_delta);
+ elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
+ strategy_passes, strategy_buf_id);
#endif
- next_to_clean = strategy_buf_id;
- next_passes = strategy_passes;
- bufs_to_lap = NBuffers;
+ strategy_delta = 0;
+ next_to_clean[partition] = strategy_buf_id;
+ next_passes[partition] = strategy_passes;
+ bufs_to_lap = num_buffers;
}
- }
- else
- {
- /*
- * Initializing at startup or after LRU scanning had been off. Always
- * start at the strategy point.
- */
-#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter initializing: strategy %u-%u",
- strategy_passes, strategy_buf_id);
-#endif
- strategy_delta = 0;
- next_to_clean = strategy_buf_id;
- next_passes = strategy_passes;
- bufs_to_lap = NBuffers;
- }
- /* Update saved info for next time */
- prev_strategy_buf_id = strategy_buf_id;
- prev_strategy_passes = strategy_passes;
- saved_info_valid = true;
+ /* Update saved info for next time */
+ prev_strategy_buf_id[partition] = strategy_buf_id;
+ prev_strategy_passes[partition] = strategy_passes;
+ /* FIXME has to happen after all partitions */
+ /* saved_info_valid = true; */
- /*
- * Compute how many buffers had to be scanned for each new allocation, ie,
- * 1/density of reusable buffers, and track a moving average of that.
- *
- * If the strategy point didn't move, we don't update the density estimate
- */
- if (strategy_delta > 0 && recent_alloc > 0)
- {
- scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
- smoothed_density += (scans_per_alloc - smoothed_density) /
- smoothing_samples;
- }
+ /*
+ * Compute how many buffers had to be scanned for each new allocation,
+ * ie, 1/density of reusable buffers, and track a moving average of
+ * that.
+ *
+ * If the strategy point didn't move, we don't update the density
+ * estimate
+ */
+ if (strategy_delta > 0 && recent_alloc_partition > 0)
+ {
+ scans_per_alloc = (float) strategy_delta / (float) recent_alloc_partition;
+ smoothed_density += (scans_per_alloc - smoothed_density) /
+ smoothing_samples;
+ }
- /*
- * Estimate how many reusable buffers there are between the current
- * strategy point and where we've scanned ahead to, based on the smoothed
- * density estimate.
- */
- bufs_ahead = NBuffers - bufs_to_lap;
- reusable_buffers_est = (float) bufs_ahead / smoothed_density;
+ /*
+ * Estimate how many reusable buffers there are between the current
+ * strategy point and where we've scanned ahead to, based on the
+ * smoothed density estimate.
+ */
+ bufs_ahead = num_buffers - bufs_to_lap;
+ reusable_buffers_est = (float) bufs_ahead / smoothed_density;
- /*
- * Track a moving average of recent buffer allocations. Here, rather than
- * a true average we want a fast-attack, slow-decline behavior: we
- * immediately follow any increase.
- */
- if (smoothed_alloc <= (float) recent_alloc)
- smoothed_alloc = recent_alloc;
- else
- smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
- smoothing_samples;
+ /*
+ * Track a moving average of recent buffer allocations. Here, rather
+ * than a true average we want a fast-attack, slow-decline behavior:
+ * we immediately follow any increase.
+ */
+ if (smoothed_alloc <= (float) recent_alloc_partition)
+ smoothed_alloc = recent_alloc_partition;
+ else
+ smoothed_alloc += ((float) recent_alloc_partition - smoothed_alloc) /
+ smoothing_samples;
- /* Scale the estimate by a GUC to allow more aggressive tuning. */
- upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
+ /* Scale the estimate by a GUC to allow more aggressive tuning. */
+ upcoming_alloc_est = (int) (smoothed_alloc * bgwriter_lru_multiplier);
- /*
- * If recent_alloc remains at zero for many cycles, smoothed_alloc will
- * eventually underflow to zero, and the underflows produce annoying
- * kernel warnings on some platforms. Once upcoming_alloc_est has gone to
- * zero, there's no point in tracking smaller and smaller values of
- * smoothed_alloc, so just reset it to exactly zero to avoid this
- * syndrome. It will pop back up as soon as recent_alloc increases.
- */
- if (upcoming_alloc_est == 0)
- smoothed_alloc = 0;
+ /*
+ * If recent_alloc remains at zero for many cycles, smoothed_alloc
+ * will eventually underflow to zero, and the underflows produce
+ * annoying kernel warnings on some platforms. Once
+ * upcoming_alloc_est has gone to zero, there's no point in tracking
+ * smaller and smaller values of smoothed_alloc, so just reset it to
+ * exactly zero to avoid this syndrome. It will pop back up as soon
+ * as recent_alloc increases.
+ */
+ if (upcoming_alloc_est == 0)
+ smoothed_alloc = 0;
- /*
- * Even in cases where there's been little or no buffer allocation
- * activity, we want to make a small amount of progress through the buffer
- * cache so that as many reusable buffers as possible are clean after an
- * idle period.
- *
- * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
- * the BGW will be called during the scan_whole_pool time; slice the
- * buffer pool into that many sections.
- */
- min_scan_buffers = (int) (NBuffers / (scan_whole_pool_milliseconds / BgWriterDelay));
+ /*
+ * Even in cases where there's been little or no buffer allocation
+ * activity, we want to make a small amount of progress through the
+ * buffer cache so that as many reusable buffers as possible are clean
+ * after an idle period.
+ *
+ * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many
+ * times the BGW will be called during the scan_whole_pool time; slice
+ * the buffer pool into that many sections.
+ */
+ min_scan_buffers = (int) (num_buffers / (scan_whole_pool_milliseconds / BgWriterDelay));
- if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
- {
+ if (upcoming_alloc_est < (min_scan_buffers + reusable_buffers_est))
+ {
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
- upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
+ elog(DEBUG2, "bgwriter: alloc_est=%d too small, using min=%d + reusable_est=%d",
+ upcoming_alloc_est, min_scan_buffers, reusable_buffers_est);
#endif
- upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
- }
-
- /*
- * Now write out dirty reusable buffers, working forward from the
- * next_to_clean point, until we have lapped the strategy scan, or cleaned
- * enough buffers to match our estimate of the next cycle's allocation
- * requirements, or hit the bgwriter_lru_maxpages limit.
- */
+ upcoming_alloc_est = min_scan_buffers + reusable_buffers_est;
+ }
- num_to_scan = bufs_to_lap;
- num_written = 0;
- reusable_buffers = reusable_buffers_est;
+ /*
+ * Now write out dirty reusable buffers, working forward from the
+ * next_to_clean point, until we have lapped the strategy scan, or
+ * cleaned enough buffers to match our estimate of the next cycle's
+ * allocation requirements, or hit the bgwriter_lru_maxpages limit.
+ */
- /* Execute the LRU scan */
- while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
- {
- int sync_state = SyncOneBuffer(next_to_clean, true,
- wb_context);
+ num_to_scan = bufs_to_lap;
+ num_written = 0;
+ reusable_buffers = reusable_buffers_est;
- if (++next_to_clean >= NBuffers)
+ /* Execute the LRU scan */
+ while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
- next_to_clean = 0;
- next_passes++;
- }
- num_to_scan--;
+ int sync_state = SyncOneBuffer(next_to_clean[partition], true,
+ wb_context);
- if (sync_state & BUF_WRITTEN)
- {
- reusable_buffers++;
- if (++num_written >= bgwriter_lru_maxpages)
+ if (++next_to_clean[partition] >= (first_buffer + num_buffers))
{
- PendingBgWriterStats.maxwritten_clean++;
- break;
+ next_to_clean[partition] = first_buffer;
+ next_passes[partition]++;
+ }
+ num_to_scan--;
+
+ if (sync_state & BUF_WRITTEN)
+ {
+ reusable_buffers++;
+ if (++num_written >= (bgwriter_lru_maxpages / num_partitions))
+ {
+ PendingBgWriterStats.maxwritten_clean++;
+ break;
+ }
}
+ else if (sync_state & BUF_REUSABLE)
+ reusable_buffers++;
}
- else if (sync_state & BUF_REUSABLE)
- reusable_buffers++;
- }
- PendingBgWriterStats.buf_written_clean += num_written;
+ PendingBgWriterStats.buf_written_clean += num_written;
#ifdef BGW_DEBUG
- elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
- recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
- smoothed_density, reusable_buffers_est, upcoming_alloc_est,
- bufs_to_lap - num_to_scan,
- num_written,
- reusable_buffers - reusable_buffers_est);
+ elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
+ recent_alloc_partition, smoothed_alloc, strategy_delta, bufs_ahead,
+ smoothed_density, reusable_buffers_est, upcoming_alloc_est,
+ bufs_to_lap - num_to_scan,
+ num_written,
+ reusable_buffers - reusable_buffers_est);
#endif
- /*
- * Consider the above scan as being like a new allocation scan.
- * Characterize its density and update the smoothed one based on it. This
- * effectively halves the moving average period in cases where both the
- * strategy and the background writer are doing some useful scanning,
- * which is helpful because a long memory isn't as desirable on the
- * density estimates.
- */
- new_strategy_delta = bufs_to_lap - num_to_scan;
- new_recent_alloc = reusable_buffers - reusable_buffers_est;
- if (new_strategy_delta > 0 && new_recent_alloc > 0)
- {
- scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
- smoothed_density += (scans_per_alloc - smoothed_density) /
- smoothing_samples;
+ /*
+ * Consider the above scan as being like a new allocation scan.
+ * Characterize its density and update the smoothed one based on it.
+ * This effectively halves the moving average period in cases where
+ * both the strategy and the background writer are doing some useful
+ * scanning, which is helpful because a long memory isn't as desirable
+ * on the density estimates.
+ */
+ new_strategy_delta = bufs_to_lap - num_to_scan;
+ new_recent_alloc = reusable_buffers - reusable_buffers_est;
+ if (new_strategy_delta > 0 && new_recent_alloc > 0)
+ {
+ scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
+ smoothed_density += (scans_per_alloc - smoothed_density) /
+ smoothing_samples;
#ifdef BGW_DEBUG
- elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
- new_recent_alloc, new_strategy_delta,
- scans_per_alloc, smoothed_density);
+ elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
+ new_recent_alloc, new_strategy_delta,
+ scans_per_alloc, smoothed_density);
#endif
+ }
+
+ /* hibernate if all partitions can hibernate */
+ hibernate &= (bufs_to_lap == 0 && recent_alloc_partition == 0);
}
+ /* now that we've scanned all partitions, mark the cached info as valid */
+ saved_info_valid = true;
+
/* Return true if OK to hibernate */
- return (bufs_to_lap == 0 && recent_alloc == 0);
+ return hibernate;
}
/*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c3fbd651dd5..ff02dc8e00b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -63,17 +63,27 @@ typedef struct BufferStrategyFreelist
#define MIN_FREELIST_PARTITIONS 4
/*
- * The shared freelist control information.
+ * Information about one partition of the ClockSweep (on a subset of buffers).
+ *
+ * XXX Should be careful to align this to cachelines, etc.
*/
typedef struct
{
/* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
+ slock_t clock_sweep_lock;
+
+ /* range for this clock weep partition */
+ int32 firstBuffer;
+ int32 numBuffers;
/*
* Clock sweep hand: index of next buffer to consider grabbing. Note that
* this isn't a concrete buffer - we only ever increase the value. So, to
* get an actual buffer, it needs to be used modulo NBuffers.
+ *
+ * XXX This is relative to firstBuffer, so needs to be offset properly.
+ *
+ * XXX firstBuffer + (nextVictimBuffer % numBuffers)
*/
pg_atomic_uint32 nextVictimBuffer;
@@ -83,6 +93,15 @@ typedef struct
*/
uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
+} ClockSweep;
+
+/*
+ * The shared freelist control information.
+ */
+typedef struct
+{
+ /* Spinlock: protects the values below */
+ slock_t buffer_strategy_lock;
/*
* Bgworker process to be notified upon activity or -1 if none. See
@@ -99,6 +118,9 @@ typedef struct
int num_partitions;
int num_partitions_per_node;
+ /* clocksweep partitions */
+ ClockSweep *sweeps;
+
BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
} BufferStrategyControl;
@@ -138,6 +160,7 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+static ClockSweep *ChooseClockSweep(void);
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -149,6 +172,7 @@ static inline uint32
ClockSweepTick(void)
{
uint32 victim;
+ ClockSweep *sweep = ChooseClockSweep();
/*
* Atomically move hand ahead one buffer - if there's several processes
@@ -156,14 +180,14 @@ ClockSweepTick(void)
* apparent order.
*/
victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+ pg_atomic_fetch_add_u32(&sweep->nextVictimBuffer, 1);
- if (victim >= NBuffers)
+ if (victim >= sweep->numBuffers)
{
uint32 originalVictim = victim;
/* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ victim = victim % sweep->numBuffers;
/*
* If we're the one that just caused a wraparound, force
@@ -189,19 +213,23 @@ ClockSweepTick(void)
* could lead to an overflow of nextVictimBuffers, but that's
* highly unlikely and wouldn't be particularly harmful.
*/
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ SpinLockAcquire(&sweep->clock_sweep_lock);
- wrapped = expected % NBuffers;
+ wrapped = expected % sweep->numBuffers;
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+ success = pg_atomic_compare_exchange_u32(&sweep->nextVictimBuffer,
&expected, wrapped);
if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ sweep->completePasses++;
+ SpinLockRelease(&sweep->clock_sweep_lock);
}
}
}
- return victim;
+
+ /* XXX buffer IDs are 1-based, we're calculating 0-based indexes */
+ Assert(BufferIsValid(1 + sweep->firstBuffer + (victim % sweep->numBuffers)));
+
+ return sweep->firstBuffer + victim;
}
static int
@@ -258,6 +286,28 @@ calculate_partition_index()
return index;
}
+/*
+ * ChooseClockSweep
+ * pick a clocksweep partition based on NUMA node and CPU
+ *
+ * The number of clocksweep partitions may not match the number of NUMA
+ * nodes, but it should not be lower. Each partition should be mapped to
+ * a single NUMA node, but a node may have multiple partitions. If there
+ * are multiple partitions per node (all nodes have the same number of
+ * partitions), we pick the partition using CPU.
+ *
+ * XXX Maybe we should do both the total and "per group" counts a power of
+ * two? That'd allow using shifts instead of divisions in the calculation,
+ * and that's cheaper. But how would that deal with odd number of nodes?
+ */
+static ClockSweep *
+ChooseClockSweep(void)
+{
+ int index = calculate_partition_index();
+
+ return &StrategyControl->sweeps[index];
+}
+
/*
* ChooseFreeList
* Pick the buffer freelist to use, depending on the CPU and NUMA node.
@@ -374,7 +424,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* the rate of buffer consumption. Note that buffers recycled by a
* strategy object are intentionally not counted here.
*/
- pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+ pg_atomic_fetch_add_u32(&ChooseClockSweep()->numBufferAllocs, 1);
/*
* First check, without acquiring the lock, whether there's buffers in the
@@ -445,13 +495,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
/*
* Nothing on the freelist, so run the "clock sweep" algorithm
*
- * XXX Should we also make this NUMA-aware, to only access buffers from
- * the same NUMA node? That'd probably mean we need to make the clock
- * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
- * subset of buffers. But that also means each process could "sweep" only
- * a fraction of buffers, even if the other buffers are better candidates
- * for eviction. Would that also mean we'd have multiple bgwriters, one
- * for each node, or would one bgwriter handle all of that?
+ * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
+ * buffers from a single partition, aligned with the NUMA node. That means
+ * it only accesses buffers from the same NUMA node.
+ *
+ * XXX That also means each process "sweeps" only a fraction of buffers,
+ * even if the other buffers are better candidates for eviction. Maybe
+ * there should be some logic to "steal" buffers from other freelists or
+ * other nodes?
+ *
+ * XXX Would that also mean we'd have multiple bgwriters, one for each
+ * node, or would one bgwriter handle all of that?
*/
trycounter = NBuffers;
for (;;)
@@ -533,6 +587,41 @@ StrategyFreeBuffer(BufferDesc *buf)
SpinLockRelease(&freelist->freelist_lock);
}
+/*
+ * StrategySyncStart -- prepare for sync of all partitions
+ *
+ * Determine the number of clocksweep partitions, and calculate the recent
+ * buffers allocs (as a sum of all the partitions). This allows BgBufferSync
+ * to calculate average number of allocations per partition for the next
+ * sync cycle.
+ *
+ * In addition it returns the count of recent buffer allocs, which is a total
+ * summed from all partitions. The alloc counts are reset after being read,
+ * as the partitions are walked.
+ */
+void
+StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc)
+{
+ *num_buf_alloc = 0;
+ *num_parts = StrategyControl->num_partitions;
+
+ /*
+ * We lock the partitions one by one, so not exacly in sync, but that
+ * should be fine. We're only looking for heuristics anyway.
+ */
+ for (int i = 0; i < StrategyControl->num_partitions; i++)
+ {
+ ClockSweep *sweep = &StrategyControl->sweeps[i];
+
+ SpinLockAcquire(&sweep->clock_sweep_lock);
+ if (num_buf_alloc)
+ {
+ *num_buf_alloc += pg_atomic_exchange_u32(&sweep->numBufferAllocs, 0);
+ }
+ SpinLockRelease(&sweep->clock_sweep_lock);
+ }
+}
+
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -540,37 +629,44 @@ StrategyFreeBuffer(BufferDesc *buf)
* BgBufferSync() will proceed circularly around the buffer array from there.
*
* In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed. The alloc count is reset after
- * being read.
+ * the higher-order bits of nextVictimBuffer).
+ *
+ * This only considers a single clocksweep partition, as BgBufferSync looks
+ * at them one by one.
*/
int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(int partition, uint32 *complete_passes,
+ int *first_buffer, int *num_buffers)
{
uint32 nextVictimBuffer;
int result;
+ ClockSweep *sweep = &StrategyControl->sweeps[partition];
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ Assert((partition >= 0) && (partition < StrategyControl->num_partitions));
+
+ SpinLockAcquire(&sweep->clock_sweep_lock);
+ nextVictimBuffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+ result = nextVictimBuffer % sweep->numBuffers;
+
+ *first_buffer = sweep->firstBuffer;
+ *num_buffers = sweep->numBuffers;
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
+ *complete_passes = sweep->completePasses;
/*
* Additionally add the number of wraparounds that happened before
* completePasses could be incremented. C.f. ClockSweepTick().
*/
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes += nextVictimBuffer / sweep->numBuffers;
}
+ SpinLockRelease(&sweep->clock_sweep_lock);
- if (num_buf_alloc)
- {
- *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
- }
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- return result;
+ /* XXX buffer IDs start at 1, we're calculating 0-based indexes */
+ Assert(BufferIsValid(1 + sweep->firstBuffer + result));
+
+ return sweep->firstBuffer + result;
}
/*
@@ -658,6 +754,10 @@ StrategyShmemSize(void)
size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
num_partitions)));
+ /* size of clocksweep partitions (at least one per NUMA node) */
+ size = add_size(size, MAXALIGN(mul_size(sizeof(ClockSweep),
+ num_partitions)));
+
return size;
}
@@ -676,6 +776,7 @@ StrategyInitialize(bool init)
int num_nodes;
int num_partitions;
int num_partitions_per_node;
+ char *ptr;
/* */
BufferPartitionParams(&num_partitions, &num_nodes);
@@ -703,7 +804,8 @@ StrategyInitialize(bool init)
StrategyControl = (BufferStrategyControl *)
ShmemInitStruct("Buffer Strategy Status",
MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
- MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
+ MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions) +
+ MAXALIGN(sizeof(ClockSweep) * num_partitions),
&found);
if (!found)
@@ -718,12 +820,41 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* have to point the sweeps array to right after the freelists */
+ ptr = (char *) StrategyControl +
+ MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+ MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions);
+ StrategyControl->sweeps = (ClockSweep *) ptr;
+
+ /* Initialize the clock sweep pointers (for all partitions) */
+ for (int i = 0; i < num_partitions; i++)
+ {
+ int node,
+ num_buffers,
+ first_buffer,
+ last_buffer;
+
+ SpinLockInit(&StrategyControl->sweeps[i].clock_sweep_lock);
+
+ pg_atomic_init_u32(&StrategyControl->sweeps[i].nextVictimBuffer, 0);
- /* Clear statistics */
- StrategyControl->completePasses = 0;
- pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+ /* get info about the buffer partition */
+ BufferPartitionGet(i, &node, &num_buffers,
+ &first_buffer, &last_buffer);
+
+ /*
+ * FIXME This may not quite right, because if NBuffers is not a
+ * perfect multiple of numBuffers, the last partition will have
+ * numBuffers set too high. buf_init handles this by tracking the
+ * remaining number of buffers, and not overflowing.
+ */
+ StrategyControl->sweeps[i].numBuffers = num_buffers;
+ StrategyControl->sweeps[i].firstBuffer = first_buffer;
+
+ /* Clear statistics */
+ StrategyControl->sweeps[i].completePasses = 0;
+ pg_atomic_init_u32(&StrategyControl->sweeps[i].numBufferAllocs, 0);
+ }
/* No pending notification */
StrategyControl->bgwprocno = -1;
@@ -771,7 +902,6 @@ StrategyInitialize(bool init)
buf->freeNext = freelist->firstFreeBuffer;
freelist->firstFreeBuffer = i;
}
-
}
}
else
@@ -1111,9 +1241,11 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
}
void
-FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free)
+FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free,
+ uint32 *complete_passes, uint32 *buffer_allocs, uint32 *next_victim_buffer)
{
BufferStrategyFreelist *freelist;
+ ClockSweep *sweep;
int cur;
/* stats */
@@ -1123,6 +1255,7 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
freelist = &StrategyControl->freelists[idx];
+ sweep = &StrategyControl->sweeps[idx];
/* stat */
SpinLockAcquire(&freelist->freelist_lock);
@@ -1152,4 +1285,11 @@ FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actu
*remain = cnt_remain;
*actually_free = cnt_free;
+
+ /* get the clocksweep stats too */
+ *complete_passes = sweep->completePasses;
+ *buffer_allocs = pg_atomic_read_u32(&sweep->numBufferAllocs);
+ *next_victim_buffer = pg_atomic_read_u32(&sweep->nextVictimBuffer);
+
+ *next_victim_buffer = sweep->firstBuffer + (*next_victim_buffer % sweep->numBuffers);
}
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9dfbecb9fe4..907b160b4f7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,7 +449,9 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncPrepare(int *num_parts, uint32 *num_buf_alloc);
+extern int StrategySyncStart(int partition, uint32 *complete_passes,
+ int *first_buffer, int *num_buffers);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index df127274190..53855d4be23 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -349,7 +349,10 @@ extern int GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
extern void FreeAccessStrategy(BufferAccessStrategy strategy);
extern void FreelistPartitionGetInfo(int idx,
uint64 *consumed, uint64 *remain,
- uint64 *actually_free);
+ uint64 *actually_free,
+ uint32 *complete_passes,
+ uint32 *buffer_allocs,
+ uint32 *next_victim_buffer);
/* inline functions */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c695cfa76e8..a38dd8d6242 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -428,6 +428,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClockSweep
ClonePtrType
ClosePortalStmt
ClosePtrType
--
2.50.1
v3-0004-NUMA-partition-buffer-freelist.patchtext/x-patch; charset=UTF-8; name=v3-0004-NUMA-partition-buffer-freelist.patchDownload
From 06a43d54498ab5049c12a458cfbd4fe3b3b168c2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v3 4/7] NUMA: partition buffer freelist
Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.
This uses the buffer partitioning scheme introduced by the earlier
patch, i.e. the partitions will "align" with NUMA nodes, etc.
It also extends the "pg_buffercache_partitions" view, to include
information about each freelist (number of consumedd buffers, ...).
When allocating a buffer, it's taken from the correct freelist (same
NUMA node).
Note: This is (probably) more important than partitioning ProcArray.
---
.../pg_buffercache--1.6--1.7.sql | 2 +-
contrib/pg_buffercache/pg_buffercache_pages.c | 24 +-
src/backend/storage/buffer/freelist.c | 360 ++++++++++++++++--
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/miscadmin.h | 1 +
src/include/storage/bufmgr.h | 4 +-
7 files changed, 372 insertions(+), 30 deletions(-)
diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
index bd97246f6ab..3871c261528 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql
@@ -12,7 +12,7 @@ LANGUAGE C PARALLEL SAFE;
-- Create a view for convenient access.
CREATE VIEW pg_buffercache_partitions AS
SELECT P.* FROM pg_buffercache_partitions() AS P
- (partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer);
+ (partition integer, numa_node integer, num_buffers integer, first_buffer integer, last_buffer integer, buffers_consumed bigint, buffers_remain bigint, buffers_free bigint);
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_buffercache_partitions() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 8baa7c7b543..668ada8c47b 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -27,7 +27,7 @@
#define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3
#define NUM_BUFFERCACHE_NUMA_ELEM 3
-#define NUM_BUFFERCACHE_PARTITIONS_ELEM 5
+#define NUM_BUFFERCACHE_PARTITIONS_ELEM 8
PG_MODULE_MAGIC_EXT(
.name = "pg_buffercache",
@@ -812,6 +812,12 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
INT4OID, -1, 0);
TupleDescInitEntry(tupledesc, (AttrNumber) 5, "last_buffer",
INT4OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 6, "buffers_consumed",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 7, "buffers_remain",
+ INT8OID, -1, 0);
+ TupleDescInitEntry(tupledesc, (AttrNumber) 8, "buffers_free",
+ INT8OID, -1, 0);
funcctx->user_fctx = BlessTupleDesc(tupledesc);
@@ -833,12 +839,19 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
first_buffer,
last_buffer;
+ uint64 buffers_consumed,
+ buffers_remain,
+ buffers_free;
+
Datum values[NUM_BUFFERCACHE_PARTITIONS_ELEM];
bool nulls[NUM_BUFFERCACHE_PARTITIONS_ELEM];
BufferPartitionGet(i, &numa_node, &num_buffers,
&first_buffer, &last_buffer);
+ FreelistPartitionGetInfo(i, &buffers_consumed, &buffers_remain,
+ &buffers_free);
+
values[0] = Int32GetDatum(i);
nulls[0] = false;
@@ -854,6 +867,15 @@ pg_buffercache_partitions(PG_FUNCTION_ARGS)
values[4] = Int32GetDatum(last_buffer);
nulls[4] = false;
+ values[5] = Int64GetDatum(buffers_consumed);
+ nulls[5] = false;
+
+ values[6] = Int64GetDatum(buffers_remain);
+ nulls[6] = false;
+
+ values[7] = Int64GetDatum(buffers_free);
+ nulls[7] = false;
+
/* Build and return the tuple. */
tuple = heap_form_tuple((TupleDesc) funcctx->user_fctx, values, nulls);
result = HeapTupleGetDatum(tuple);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..c3fbd651dd5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,52 @@
*/
#include "postgres.h"
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
#include "pgstat.h"
#include "port/atomics.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/proc.h"
#define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var))))
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+ /* Spinlock: protects the values below */
+ slock_t freelist_lock;
+
+ /*
+ * XXX Not sure why this needs to be aligned like this. Need to ask
+ * Andres.
+ */
+ int firstFreeBuffer __attribute__((aligned(64))); /* Head of list of
+ * unused buffers */
+
+ /* Number of buffers consumed from this list. */
+ uint64 consumed;
+} BufferStrategyFreelist;
+
+/*
+ * The minimum number of partitions we want to have. We want at least this
+ * number of partitions, even on non-NUMA system, as it helps with contention
+ * for buffers. But with multiple NUMA nodes, we want a separate partition per
+ * node. But we may get multiple partitions per node, for low node count.
+ *
+ * With multiple partitions per NUMA node, we pick the partition based on CPU
+ * (or some other parameter).
+ */
+#define MIN_FREELIST_PARTITIONS 4
/*
* The shared freelist control information.
@@ -39,8 +77,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -51,8 +87,19 @@ typedef struct
/*
* Bgworker process to be notified upon activity or -1 if none. See
* StrategyNotifyBgWriter.
+ *
+ * XXX Not sure why this needs to be aligned like this. Need to ask
+ * Andres. Also, shouldn't the alignment be specified after, like for
+ * "consumed"?
*/
- int bgwprocno;
+ int __attribute__((aligned(64))) bgwprocno;
+
+ /* info about freelist partitioning */
+ int num_nodes; /* effectively number of NUMA nodes */
+ int num_partitions;
+ int num_partitions_per_node;
+
+ BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
} BufferStrategyControl;
/* Pointers to shared state */
@@ -157,6 +204,88 @@ ClockSweepTick(void)
return victim;
}
+static int
+calculate_partition_index()
+{
+ int rc;
+ unsigned cpu;
+ unsigned node;
+ int index;
+
+ Assert(StrategyControl->num_partitions ==
+ (StrategyControl->num_nodes * StrategyControl->num_partitions_per_node));
+
+ /*
+ * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+ * list based on that.
+ */
+ rc = getcpu(&cpu, &node);
+ if (rc != 0)
+ elog(ERROR, "getcpu failed: %m");
+
+ /*
+ * XXX We should't get nodes that we haven't considered while building the
+ * partitions. Maybe if we allow this (e.g. due to support adjusting the
+ * NUMA stuff at runtime), we should just do our best to minimize the
+ * conflicts somehow. But it'll make the mapping harder, so for now we
+ * ignore it.
+ */
+ if (node > StrategyControl->num_nodes)
+ elog(ERROR, "node out of range: %d > %u", cpu, StrategyControl->num_nodes);
+
+ /*
+ * Find the partition. If we have a single partition per node, we can
+ * calculate the index directly from node. Otherwise we need to do two
+ * steps, using node and then cpu.
+ */
+ if (StrategyControl->num_partitions_per_node == 1)
+ {
+ index = (node % StrategyControl->num_partitions);
+ }
+ else
+ {
+ int index_group,
+ index_part;
+
+ /* two steps - calculate group from node, partition from cpu */
+ index_group = (node % StrategyControl->num_nodes);
+ index_part = (cpu % StrategyControl->num_partitions_per_node);
+
+ index = (index_group * StrategyControl->num_partitions_per_node)
+ + index_part;
+ }
+
+ return index;
+}
+
+/*
+ * ChooseFreeList
+ * Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+ int index = calculate_partition_index();
+
+ return &StrategyControl->freelists[index];
+}
+
/*
* have_free_buffer -- a lockless check to see if there is a free buffer in
* buffer pool.
@@ -168,10 +297,13 @@ ClockSweepTick(void)
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
+ for (int i = 0; i < StrategyControl->num_partitions; i++)
+ {
+ if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+ return true;
+ }
+
+ return false;
}
/*
@@ -193,6 +325,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
int bgwprocno;
int trycounter;
uint32 local_buf_state; /* to avoid repeated (de-)referencing */
+ BufferStrategyFreelist *freelist;
*from_ring = false;
@@ -259,31 +392,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
* manipulate them without holding the spinlock.
*/
- if (StrategyControl->firstFreeBuffer >= 0)
+ freelist = ChooseFreeList();
+ if (freelist->firstFreeBuffer >= 0)
{
while (true)
{
/* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ SpinLockAcquire(&freelist->freelist_lock);
- if (StrategyControl->firstFreeBuffer < 0)
+ if (freelist->firstFreeBuffer < 0)
{
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
break;
}
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+ buf = GetBufferDescriptor(freelist->firstFreeBuffer);
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
/* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
+ freelist->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
+ /* increment number of buffers we consumed from this list */
+ freelist->consumed++;
+
/*
* Release the lock so someone else can access the freelist while
* we check out this buffer.
*/
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +442,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /*
+ * Nothing on the freelist, so run the "clock sweep" algorithm
+ *
+ * XXX Should we also make this NUMA-aware, to only access buffers from
+ * the same NUMA node? That'd probably mean we need to make the clock
+ * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+ * subset of buffers. But that also means each process could "sweep" only
+ * a fraction of buffers, even if the other buffers are better candidates
+ * for eviction. Would that also mean we'd have multiple bgwriters, one
+ * for each node, or would one bgwriter handle all of that?
+ */
trycounter = NBuffers;
for (;;)
{
@@ -356,7 +503,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
void
StrategyFreeBuffer(BufferDesc *buf)
{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ BufferStrategyFreelist *freelist;
+
+ /*
+ * We don't want to call ChooseFreeList() again, because we might get a
+ * completely different freelist - either a different partition in the
+ * same group, or even a different group if the NUMA node changed. But we
+ * can calculate the proper freelist from the buffer id.
+ */
+ int index = (BufferGetNode(buf->buf_id) * StrategyControl->num_partitions_per_node)
+ + (buf->buf_id % StrategyControl->num_partitions_per_node);
+
+ Assert((index >= 0) && (index < StrategyControl->num_partitions));
+
+ freelist = &StrategyControl->freelists[index];
+
+ SpinLockAcquire(&freelist->freelist_lock);
/*
* It is possible that we are told to put something in the freelist that
@@ -364,11 +526,11 @@ StrategyFreeBuffer(BufferDesc *buf)
*/
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
{
- buf->freeNext = StrategyControl->firstFreeBuffer;
- StrategyControl->firstFreeBuffer = buf->buf_id;
+ buf->freeNext = freelist->firstFreeBuffer;
+ freelist->firstFreeBuffer = buf->buf_id;
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ SpinLockRelease(&freelist->freelist_lock);
}
/*
@@ -432,6 +594,42 @@ StrategyNotifyBgWriter(int bgwprocno)
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+ for (int p = 0; p < StrategyControl->num_partitions; p++)
+ {
+ BufferStrategyFreelist *freelist = &StrategyControl->freelists[p];
+ uint64 remain = 0;
+ uint64 actually_free = 0;
+ int cur = freelist->firstFreeBuffer;
+
+ while (cur >= 0)
+ {
+ uint32 local_buf_state;
+ BufferDesc *buf;
+
+ buf = GetBufferDescriptor(cur);
+
+ remain++;
+
+ local_buf_state = LockBufHdr(buf);
+
+ if (!(local_buf_state & BM_TAG_VALID))
+ actually_free++;
+
+ UnlockBufHdr(buf, local_buf_state);
+
+ cur = buf->freeNext;
+ }
+ elog(LOG, "NUMA: freelist partition %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+ p,
+ freelist->firstFreeBuffer,
+ freelist->consumed,
+ remain, actually_free);
+ }
+}
/*
* StrategyShmemSize
@@ -445,12 +643,20 @@ Size
StrategyShmemSize(void)
{
Size size = 0;
+ int num_partitions;
+ int num_nodes;
+
+ BufferPartitionParams(&num_partitions, &num_nodes);
/* size of lookup hash table ... see comment in StrategyInitialize */
size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
/* size of the shared replacement strategy control block */
- size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+ size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+ /* size of freelist partitions (at least one per NUMA node) */
+ size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+ num_partitions)));
return size;
}
@@ -467,6 +673,18 @@ StrategyInitialize(bool init)
{
bool found;
+ int num_nodes;
+ int num_partitions;
+ int num_partitions_per_node;
+
+ /* */
+ BufferPartitionParams(&num_partitions, &num_nodes);
+
+ /* always a multiple of NUMA nodes */
+ Assert(num_partitions % num_nodes == 0);
+
+ num_partitions_per_node = (num_partitions / num_nodes);
+
/*
* Initialize the shared buffer lookup hashtable.
*
@@ -484,7 +702,8 @@ StrategyInitialize(bool init)
*/
StrategyControl = (BufferStrategyControl *)
ShmemInitStruct("Buffer Strategy Status",
- sizeof(BufferStrategyControl),
+ MAXALIGN(offsetof(BufferStrategyControl, freelists)) +
+ MAXALIGN(sizeof(BufferStrategyFreelist) * num_partitions),
&found);
if (!found)
@@ -494,13 +713,10 @@ StrategyInitialize(bool init)
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
+ /* register callback to dump some stats on exit */
+ before_shmem_exit(freelist_before_shmem_exit, 0);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
+ SpinLockInit(&StrategyControl->buffer_strategy_lock);
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +727,52 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwprocno = -1;
+
+ /* initialize the partitioned clocksweep */
+ StrategyControl->num_partitions = num_partitions;
+ StrategyControl->num_nodes = num_nodes;
+ StrategyControl->num_partitions_per_node = num_partitions_per_node;
+
+ /*
+ * Rebuild the freelist - right now all buffers are in one huge list,
+ * we want to rework that into multiple lists. Start by initializing
+ * the strategy to have empty lists.
+ */
+ for (int nfreelist = 0; nfreelist < num_partitions; nfreelist++)
+ {
+ int node,
+ num_buffers,
+ first_buffer,
+ last_buffer;
+
+ BufferStrategyFreelist *freelist;
+
+ freelist = &StrategyControl->freelists[nfreelist];
+
+ freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+ SpinLockInit(&freelist->freelist_lock);
+
+ /* get info about the buffer partition */
+ BufferPartitionGet(nfreelist, &node,
+ &num_buffers, &first_buffer, &last_buffer);
+
+ /*
+ * Walk through buffers for each partition, add them to the list.
+ * Walk from the end, because we're adding the buffers to the
+ * beginning.
+ */
+
+ for (int i = last_buffer; i >= first_buffer; i--)
+ {
+ BufferDesc *buf = GetBufferDescriptor(i);
+
+ /* add to the freelist */
+ buf->freeNext = freelist->firstFreeBuffer;
+ freelist->firstFreeBuffer = i;
+ }
+
+ }
}
else
Assert(!init);
@@ -847,3 +1109,47 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
return true;
}
+
+void
+FreelistPartitionGetInfo(int idx, uint64 *consumed, uint64 *remain, uint64 *actually_free)
+{
+ BufferStrategyFreelist *freelist;
+ int cur;
+
+ /* stats */
+ uint64 cnt_remain = 0;
+ uint64 cnt_free = 0;
+
+ Assert((idx >= 0) && (idx < StrategyControl->num_partitions));
+
+ freelist = &StrategyControl->freelists[idx];
+
+ /* stat */
+ SpinLockAcquire(&freelist->freelist_lock);
+
+ *consumed = freelist->consumed;
+
+ cur = freelist->firstFreeBuffer;
+ while (cur >= 0)
+ {
+ uint32 local_buf_state;
+ BufferDesc *buf;
+
+ buf = GetBufferDescriptor(cur);
+
+ cnt_remain++;
+
+ local_buf_state = LockBufHdr(buf);
+
+ if (!(local_buf_state & BM_TAG_VALID))
+ cnt_free++;
+
+ UnlockBufHdr(buf, local_buf_state);
+
+ cur = buf->freeNext;
+ }
+ SpinLockRelease(&freelist->freelist_lock);
+
+ *remain = cnt_remain;
+ *actually_free = cnt_free;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..a11bc71a386 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int MaxBackends = 0;
/* NUMA stuff */
bool numa_buffers_interleave = false;
bool numa_localalloc = false;
+bool numa_partition_freelist = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a21f20800fb..0552ed62cc7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2136,6 +2136,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_partition_freelist", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+ gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+ },
+ &numa_partition_freelist,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..66baf2bf33e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT bool numa_partition_freelist;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index deaf4f19fa4..df127274190 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -347,7 +347,9 @@ extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
extern int GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
extern void FreeAccessStrategy(BufferAccessStrategy strategy);
-
+extern void FreelistPartitionGetInfo(int idx,
+ uint64 *consumed, uint64 *remain,
+ uint64 *actually_free);
/* inline functions */
--
2.50.1
v3-0003-freelist-Don-t-track-tail-of-a-freelist.patchtext/x-patch; charset=UTF-8; name=v3-0003-freelist-Don-t-track-tail-of-a-freelist.patchDownload
From 3c7dbbec4ee3957c92bf647605768beb5473f66b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v3 3/7] freelist: Don't track tail of a freelist
The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
src/backend/storage/buffer/freelist.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
pg_atomic_uint32 nextVictimBuffer;
int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
/*
* Statistics. These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
{
buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
}
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
* assume it was previously set up by BufferManagerShmemInit().
*/
StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
--
2.50.1
v3-0002-NUMA-localalloc.patchtext/x-patch; charset=UTF-8; name=v3-0002-NUMA-localalloc.patchDownload
From d08a94b0f54a4c4986d351f98269905bb511624c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v3 2/7] NUMA: localalloc
Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).
This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.
XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
src/backend/utils/init/globals.c | 1 +
src/backend/utils/init/miscinit.c | 17 +++++++++++++++++
src/backend/utils/misc/guc_tables.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
4 files changed, 29 insertions(+)
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int MaxBackends = 0;
/* NUMA stuff */
bool numa_buffers_interleave = false;
+bool numa_localalloc = false;
/* GUC parameters for vacuum */
int VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 43b4dbccc3d..079974944e9 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
#include <arpa/inet.h>
#include <utime.h>
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
@@ -164,6 +168,19 @@ InitPostmasterChild(void)
(errcode_for_socket_access(),
errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
#endif
+
+#ifdef USE_LIBNUMA
+
+ /*
+ * Set the default allocation policy to local node, where the task is
+ * executing at the time of a page fault.
+ *
+ * XXX I believe this is not necessary, now that we don't use automatic
+ * interleaving (numa_set_interleave_mask).
+ */
+ if (numa_localalloc)
+ numa_set_localalloc();
+#endif
}
/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9570087aa60..a21f20800fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+ gettext_noop("Enables setting the default allocation policy to local node."),
+ gettext_noop("When enabled, allocate from the node where the task is executing."),
+ },
+ &numa_localalloc,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
--
2.50.1
On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
just a quick look here:
2) The PGPROC part introduces a similar registry, [..]
There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.
If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.
3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".
Thanks!
simple benchmark
----------------
[..]
There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").
Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.
Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.
For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).
Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)
[..]
To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.
Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?
reserving number of huge pages
------------------------------
[..]
It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See$ numastat -cm
Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.
Yup, similiar story as with OOMs just for per-zone/node.
And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.
I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flag
Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.
The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.
Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages still works OK to
establish the exact count for vm.nr_hugepages, right?
Regards,
-J.
On 7/30/25 10:29, Jakub Wartak wrote:
On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi Tomas,
just a quick look here:
2) The PGPROC part introduces a similar registry, [..]
There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".Thanks!
simple benchmark
----------------[..]
There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.
The pgbench pinning certainly reduces some of the noise / overhead you
get when using multiple machines. I use it to "isolate" patches, and
make the effects more visible.
Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.
Yes, that's possible. The simple pgbench workload is a bit of a "worst
case" for the NUMA patches, in that it's can benefit less from the
improvements, and it's also fairly sensitive to regressions.
I plan to do more tests with other types of workloads, like the one
doing a lot of large sequential scans, etc.
For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)
Maybe I don't understand the question, but the last column (pinning)
compares two builds.
1) Build with all the patches up to "pgproc interleaving" (and all of
the GUCs set to "on").
2) Build with all the patches from (1), and "pinning" too (again, all
GUCs set to "on).
Or do I misunderstand the question?
[..]
To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?
Thanks. I think we'll try something like (1), but it'll need to be a bit
more elaborate, because scans on tables larger than 1/4 shared buffers
use a small circular buffer.
reserving number of huge pages
------------------------------[..]
It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See$ numastat -cm
Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.Yup, similiar story as with OOMs just for per-zone/node.
And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flagAnyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.
I think I tried using numa_set_strict, but it didn't change the behavior
(i.e. the numa_tonode_memory didn't error out).
The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages still works OK to
establish the exact count for vm.nr_hugepages, right?
Well, yes and no. It tells you the exact number of huge pages, but it
does not tell you how much you need to inflate it to account for the
non-shared buffer part that may get allocated on a random node.
regards
--
Tomas Vondra
Hi,
Here's an updated version of the patch series. The main improvement is
the new 0006 patch, adding "adaptive balancing" of allocations. I'll
also share some results from a workload doing a lot of allocations.
adaptive balancing of allocations
---------------------------------
Imagine each backend only allocates buffers from the partition on the
same NUMA node. E.g. you have 4 NUMA nodes (i.e. 4 partitions), and a
backend only allocates buffers from "home" partition (on the same NUMA
node). This is what the earlier patch versions did, and with many
backends that's mostly fine (assuming the backends get spread over all
the NUMA nodes).
But if there's only few backends doing the allocations, this can result
in very inefficient use of shared buffers - a single backend would be
limited to 25% of buffers, even if the rest is unused.
There needs to be some say to "redirect" excess allocations to other
partitions, so that the partitions are utilized about the same. This is
what the 0006 patch aims to do (I kept is separate, but it should
probably get merged into the "clocksweep partitioning" in the end).
The balancing is fairly simple:
(1) It tracks the number of allocations "requested" from each partition.
(2) In regular intervals (by bgwriter) calculate the "fair share" per
partition, and determine what fraction of "requests" to handle from the
partition itself, and how many to redirect to other partitions.
(3) Calculate coefficients to drive this for each partition.
I emphasize (1) talks about "requests", not the actual allocations. Some
of the requests could have been redirected to different partitions, and
be counted as allocations there. We want to balance allocations, but it
relies on the requests.
To give you a simple example - imagine there are 2 partitions with this
number of allocation requests:
P1: 900,000 requests
P2: 100,000 requests
This means the "fair share" is 500,000 allocations, so P1 needs to
redirect some requests to P2. And we end with these weights:
P1: [ 55, 45]
P2: [ 0, 100]
Assuming the workload does not shift in some dramatic way, this should
result in both partitions handling ~500k allocations.
It's not hard to extend this algorithm to more partitions. For more
details see StrategySyncBalance(), which recalculates this.
There are a couple open questions, like:
* The algorithm combines the old/new weights by averaging, to add a bit
of hysteresis. Right now it's a simple average with 0.5 weight, to
dampen sudden changes. I think it works fine (in the long run), but I'm
open to suggestions how to do this better.
* There's probably additional things we should consider when deciding
where to redirect the allocations. For example, we may have multiple
partitions per NUMA node, in which case it's better to redirect to that
node as many allocations as possible. The current patch ignores this.
* The partitions may have slightly different sizes, but the balancing
ignores that for now. This is not very difficult to address.
clocksweep benchmark
--------------------
I ran a simple benchmark focused on allocation-heavy workloads, namely
large concurrent sequential scans. The attached scripts generate a
number of 1GB tables, and then run concurrent sequential scans with
shared buffers set to 60%, 75%, 90% and 110% of the total dataset size.
I did this for master, and with the NUMA patches applied (and the GUCs
set to 'on'). I also increased tried with the of partitions increased to
16 (so a NUMA node got multiple partitions).
There are results from three machines
1) ryzen - small non-NUMA system, mostly to see if there's regressions
2) xeon - older 2-node NUMA system
3) hb176 - big EPYC system with 176 cores / 4 NUMA nodes
The script records detailed TPS stats (e.g. percentiles), I'm attaching
CSV files with complete results, and some PDFs with charts summarizing
that (I'll get to that in a minute).
For the EPYC, the average tps for the three builds looks like this:
clients | master numa numa-16 | numa numa-16
----------|--------------------------------|---------------------
8 | 20 27 26 | 133% 129%
16 | 23 39 45 | 170% 193%
24 | 23 48 58 | 211% 252%
32 | 21 57 68 | 268% 321%
40 | 21 56 76 | 265% 363%
48 | 22 59 82 | 270% 375%
56 | 22 66 88 | 296% 397%
64 | 23 62 93 | 277% 411%
72 | 24 68 95 | 277% 389%
80 | 24 72 95 | 295% 391%
88 | 25 71 98 | 283% 392%
96 | 26 74 97 | 282% 369%
104 | 26 74 97 | 282% 367%
112 | 27 77 95 | 287% 355%
120 | 28 77 92 | 279% 335%
128 | 27 75 89 | 277% 328%
That's not bad - the clocksweep partitioning increases the throughput
2-3x. Having 16 partitions (instead of 4) helps yet a bit more, to 3-4x.
This is for shared buffers set to 60% of the dataset, which depends on
the number of clients / tables. With 64 clients/tables, there's 64GB of
data, and shared buffers are set to ~39GB.
The results for 75% and 90% follow the same pattern. For 110% there's
much less impact - there are no allocations, so this has to be thanks to
the other NUMA patches.
The charts in the attached PDFs add a bit more detail, with various
percentiles (of per-second throughput). The bands are roughly quartiles:
5-25%, 25-50%, 50-75%, 75-95%. The thick middle line is the median.
There's only charts for 60%, 90% and 110% shared buffers, for fit it on
a single page. There 75% is not very different.
For ryzen there's little difference. Not surprising, it's not a NUMA
system. So this is positive result, as there's no regression.
For xeon the patches help a little bit. Again, not surprising. It's a
fairly old system (~2016), and the differences between NUMA nodes are
not that significant.
For epyc (hb176), the differences are pretty massive.
regards
--
Tomas Vondra
Attachments:
numa-benchmark-ryzen.pdfapplication/pdf; name=numa-benchmark-ryzen.pdfDownload
%PDF-1.7
%����
4 0 obj
<< /Length 5 0 R
/Filter /FlateDecode
>>
stream
x�-�1�0Cw�� � ��{�L�+D��*�p|�)�d��V
��R������k[����',�-|��yt�����1���x�� �DF�n3��%���?2?TT�-��
endstream
endobj
5 0 obj
110
endobj
3 0 obj
<<
/ExtGState <<
/a0 << /CA 1 /ca 1 >>
>>
/XObject << /x7 7 0 R >>
>>
endobj
7 0 obj
<< /Length 10 0 R
/Filter /FlateDecode
/Type /XObject
/Subtype /Form
/BBox [ 0 0 1754 1241 ]
/Resources 9 0 R
>>
stream
x���Kn!����(���������Y%�$���LHR*V6���'7�W���l_f����� �����c��FY��#,N2��z�u��nq�*�,�E�<�bP�@R����.�7��qM�X�$��FD,]?D1y��"2�0�f�y0��A�J)4��_A5��B�*c��$�R�1xL�� }aUE$�d��gzm��Q�����o�����Vy]Z���g���
endstream
endobj
10 0 obj
233
endobj
9 0 obj
<<
/ExtGState <<
/gs0 << /BM /Normal /SMask /None /CA 1.0 /ca 1.0 >>
/a0 << /CA 1 /ca 1 >>
>>
/XObject << /x11 11 0 R /x12 12 0 R /x13 13 0 R /x14 14 0 R /x15 15 0 R /x16 16 0 R /x17 17 0 R /x18 18 0 R /x19 19 0 R >>
>>
endobj
11 0 obj
<< /Length 20 0 R
/Filter /FlateDecode
/Type /XObject
/Subtype /Image
/Width 600
/Height 371
/ColorSpace /DeviceRGB
/Interpolate true
/BitsPerComponent 8
>>
stream
x���yP��?~Wd[�j�j�����qv����������R���k7N��w�1�8�����l+���!� ]���[�����/qb��k���������<rg��m
�W��������?t��= ��B�d�;???��|O_E�cb}��=����vCC�����y<������o ��p���K�.yxx�x������q��Zm^^��3g���
4
=�h4feey{{����L�
�����O�^]]ek� `iii���(++���l6�}}}��~�������Eyyyegg�������=���b0"""�����966Ff0���g
��A���������������������===E��b.6 �_YYEQL�A�\NQTSS��~����E���n��
T^^NQ��K�FGG�z���\\\EQd�LFQ��������z�L&����(�>q�����4�:--���s�77����z����U �K�����*���J��~�������tv��d���=y�������h���*++I3==�����z��7oR���C�nnn%%%�uII���=gee%EQ���lm p��NWVV������O�:������gff������N�:�����j�/Y]]-..�t���������s����u:�j2�����������������R���h��vxx822������3$$D&���"{������8�����6$<<���+6'm�J�Z������y��~�����]n�Vk���KJJ�����=�����z�J����8}����gdd��������Ldd������_ff�J��0���X���?����LOO�������
OOO_\\�j+��600�y���q��BBB�yg���+�����s����+,,$����y<y������E*�Avz�.]���.,,�(*))���'..������3Eeff���F#���������={��(��\AAEQ/^,,,���9}�4EQ���d�FQTXX�����vss�������������(��SdO���s��������r�[����������������������%��r�����F��|���������h��._�LWO�V{��Er��������d������^�r���(88���El_7�,v�����t��)777�@P\\|��u����=k�������a�D"�(*++K.�_�~���������B����CQ�Z��{�j5YO����
%�CCCcbb����D���S
�@#����<���#�[�lmm�(��������9�����I����L�$�����3o�������S
����������x<2������K��m����(���W�loff&�"��|��Uzo�����B���]�X,����N������ok2�H���:a4�j���PUUUe=���lx��x^t?���Z���,Y+�Dbs+��x����������.���������EQ�m!L&EQ�����qoo�������S�NMMMY,������B�]�� `7�ndnn�4�F#�YZZ��c�����[ZZ�g �Keee����6�i������b����(�����?))i��������26�n�J�yQ�������
���^�k����+��bQ*�EY�`}jpppuphh���(�����3X�����t�b19�ab� }�6##cqq�h4���Y�9y�$EQ�K7��E�<y��Y\\������[^^&���.]
2�LF�133�\#�yW ��W�����X�F6�Q�ZmyyyHH9E#S����?�'O� �y��������555[��y��������������_%�8������~{m�!�����'���ZBj4��A����j�]�3X�����tt:9�&��iii]]]��]��� �b}}�>�'o��Mz��������%������L��|W ���{��{��+W�^.>>�����f���������s�����'Offf�B�aOK����l����
���g��Y���������[RR�y����e���{���������[�`�%�������c4�Ritt4}������[���~��xs�d������L�� G��r����G�nsW ��{���zYpp0}ruu����<���\uu��S�(�jkk���lhh�s����A�HDQ�����l��U���q{m�!Ob}�u2��z ��������W��������2//����}�t�z}OO#t������=��������+++��`�I�q������(��������E��m�� �t�u�#�S��e�����CU�����o�=gii)�4==M�`�� ���������������[��Ol�J���;n��5LHH�(����\,++�Z���)�����g G:�Z�����{�������EQ=���s�up?����
W��Z������L&��4����b��H��������z���������������m�� {u�u����tLMM�t������?��|���7�Lccc���^UWWS����������h�B������}||���;�A�����222����Wi�:x��������nnn���---�avv�<���mbb����������`0�������~p�SS:::������I��X,d� J�ree%;;�������~:j�����<yR*��1�ie�������������Z\\EN��7������|~QQ����e}�L�L�+W�\�p�z��VwU ����:h2������VUUE�
���z���yxx����/�������T������K��c$�����[�~����w����������i����`ss���'=���du���w///r��#�����{=/���Nww��aBW�^�\��GMM��}^^^---�3H$�<<<��9�k}���n�� �������1�L����������XgggYY����GFFjkk+**$ �N������!N�����Igkk��Y8r�V7�#���2�O�i�U������n��*�J*�VTT���Y?����T___UU5<<lsM�A�<22RUU�����}&U����������h4nX��kns[v��Q��MMM���555L�z���,�+**m�����,�***D"��w�
���C�6�U L������^G ���n��z �A*�FX���JKKdmL&SCCClllTTTvv���<�j����f�H$���������������������l6�K������U(�� �%�:��������������@,����������d��p��l����������h0������I����h4fgg�D"�8�J��? ��
u��u�Vkk�����G��������� �_�?S~jj������kuu��L����mmm���P�T�W >>~ii�nVWW�����l����o�M���fff��'�/!!app�d2����5 ���:h4srrz{{rss�������J�:x��������N��"�RSSE"���7�E�T*222�bqyy��7VVV6�����M&������w9�V�qqqt-�X,uuud%��'������%I~~��� ;&�J���H]�H$uuu� //�h4Z,����RI�����><4yyy��iaa!--�>�900P\\L�yff&]S***���6�@VV���ym4322����r���###�Ycc���������h�Z ����b�8%%%))ill�1"�[[[-���lJJ
�o2�JJJ�B!ivtt������b�8&&�|s���d�9���������Z�X���������|��l���8��w[�Z�6---77�l2 ���Eu:]zzzWW=uvv633������fsUU���7���������������6�EZ,�R��N�TVWW�����<6����5
��uuud���'U\\�����h����r���� �`�P����bccggg����������X��������[X������������z���6��III�C_��g����B�B���;99�M?9V�o����NHH�O ��\�����/�544���
�ljj***"Wi:�.!!���E��k�Z�DRPPp�u������^������D�����+��<[����dff��#:;;�����l�?<<���d}��������a3 ��l�Mf��������4���#""�2��/_XX(((������)//'CM�����������7n�����ww3�V�&�I(���EEE������6�ccc7��o6�KJJjjj��
����j||���V ��������"