NUMA shared memory interleaving

Started by Jakub Wartak9 months ago15 messages

jakub.wartak@enterprisedb.com

9 months ago

1 attachment(s)

Thanks to having pg_numa.c, we can now simply address problem#2 of
NUMA imbalance from [1]https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf pages 11-14, by interleaving shm memory in
PG19 - patch attached. We do not need to call numa_set_localalloc() as
we only interleave shm segments, while local allocations stay the same
(well, "local" means relative to the CPU asking for private memory).
Below is result from legacy 4s32t64 Sandy Bridge EP box with low NUMA
(QPI) interconnect bandwidth to better illustrate the problem (it's
little edgecase, but some one may hit it):

Testcase:
small SB (here it was 4GB*) that fully fits NUMA hugepage zone as
this was tested with hugepages=on

$ cat seqconcurrscans.pgb
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;

/usr/local/pgsql/bin/pg_ctl -D /db/data -l logfile restart
/usr/local/pgsql/bin/psql -c "select
pg_prewarm('pgbench_accounts_'||s) from generate_series(1, 8) s;"
#load all using current policy
/usr/local/pgsql/bin/psql -c "select * from
pg_shmem_allocations_numa where name = 'Buffer Blocks';"
/usr/local/pgsql/bin/pgbench -c 64 -j 8 -P 1 -T 60 -f seqconcurrscans.pgb

on master and numa=off (default) and in previous versions:
name | numa_node | size
---------------+-----------+------------
Buffer Blocks | 0 | 0
Buffer Blocks | 1 | 0
Buffer Blocks | 2 | 4297064448
Buffer Blocks | 3 | 0

latency average = 1826.324 ms
latency stddev = 665.567 ms
tps = 34.708151 (without initial connection time)

on master and numa=on:
name | numa_node | size
---------------+-----------+------------
Buffer Blocks | 0 | 1073741824
Buffer Blocks | 1 | 1073741824
Buffer Blocks | 2 | 1075838976
Buffer Blocks | 3 | 1073741824

latency average = 1002.288 ms
latency stddev = 214.392 ms
tps = 63.344814 (without initial connection time)

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.
2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)
3. Should we fail to start if we numa=on on an unsupported platform?

* interesting tidbit to get reliable measurement: one needs to double
check that s_b (hugepage allocation) is smaller than per-NUMA zone
free hugepages (s_b fits static hugepage allocation within a single
zone). This shouldn't be a problem on 2 sockets (as most of the time
there, s_b is < 50% RAM anyway, well usually 26-30% with some stuff by
max_connections, it's higher than 25% but people usually sysctl
nr_hugepages=25%RAM) , but with >= 4 NUMA nodes (4 sockets or some
modern MCMs) kernel might start spilling the s_b (> 25%) to the other
NUMA node on it's own, so it's best to verify it using
pg_shmem_allocations_numa...

-J.

[1]: https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf

Attachments:

v1-0001-Add-capability-to-interleave-shared-memory-across.patchapplication/octet-stream; name=v1-0001-Add-capability-to-interleave-shared-memory-across.patchDownload

From f7295a1c4cd07c393ced70ad0c8622efdb6ab26d Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Wed, 16 Apr 2025 10:23:31 +0200
Subject: [PATCH v1] Add capability to interleave shared memory across multiple
 NUMA nodes

Introduce new GUC numa=off(default)/on/auto that might be used to
enable interleaving shared memory. Until today, imbalances in shared memory
allocations on NUMA setups, may have caused non-deterministic performance
due to differences in latencies and bandwidths across interconnects ("remote"
access). This is only supported on Linux with libnuma.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Inspired-by: Andres Freund <andres@anarazel.de>
Reviewed-by:
Discussion:
---
 doc/src/sgml/config.sgml                      | 16 ++++++++++++++++
 src/backend/port/sysv_shmem.c                 |  7 +++++++
 src/backend/utils/misc/guc_tables.c           | 18 ++++++++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/port/pg_numa.h                    |  1 +
 src/include/storage/pg_shmem.h                | 10 ++++++++++
 src/port/pg_numa.c                            | 17 +++++++++++++++++
 7 files changed, 71 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1674c22cb2..15397df71d6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2313,6 +2313,22 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-numa" xreflabel="numa">
+      <term><varname>numa</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>numa</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies wheter to use NUMA interleaving policy for the shared memory
+        segment. Possible values are <literal>off</literal>, <literal>on</literal>
+        and <literal>auto</literal>. This parameter is only effective on Linux.
+        The default value is <literal>off</literal>. This parameter can only be set
+        at server start.
+       </para>
+      </listitem>
+     </varlistentry>
      </variablelist>
      </sect2>
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..510b0e53638 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -29,6 +29,7 @@
 
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "port/pg_numa.h"
 #include "portability/mem.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
@@ -663,6 +664,12 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
+	if (numa == NUMA_ON || (numa == NUMA_AUTO && pg_numa_get_max_node() > 1))
+	{
+		elog(DEBUG1, "enabling NUMA shm interleaving");
+		pg_numa_interleave_memptr(ptr, allocsize);
+	}
+
 	*size = allocsize;
 	return ptr;
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..e4c9491df78 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -491,6 +491,13 @@ static const struct config_enum_entry file_copy_method_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry numa_options[] = {
+	{"off", NUMA_OFF, false},
+	{"on", NUMA_ON, false},
+	{"auto", NUMA_AUTO, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -579,6 +586,7 @@ static int	ssl_renegotiation_limit;
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
 int			huge_pages_status = HUGE_PAGES_UNKNOWN;
+int			numa = DEFAULT_NUMA;
 
 /*
  * These variables are all dummies that don't do anything, except in some
@@ -5418,6 +5426,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, assign_io_method, NULL
 	},
 
+	{
+		{"numa", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Whether to use NUMA interleaving for shared memory."),
+			NULL
+		},
+		&numa,
+		DEFAULT_NUMA, numa_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..f46b2d8de4d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -137,6 +137,8 @@
 					# (change requires restart)
 #huge_page_size = 0			# zero for system default
 					# (change requires restart)
+#numa = off				# off, on, auto
+					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 40f1d324dcf..129663de2e8 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -17,6 +17,7 @@
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT int pg_numa_interleave_memptr(void *ptr, size_t sz);
 
 #ifdef USE_LIBNUMA
 
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..1b09f7ec390 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -46,6 +46,7 @@ extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT int numa;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -64,6 +65,15 @@ typedef enum
 	SHMEM_TYPE_MMAP,
 }			PGShmemType;
 
+typedef enum
+{
+	NUMA_OFF,
+	NUMA_ON,
+	NUMA_AUTO,
+}			NumaType;
+
+#define DEFAULT_NUMA NUMA_OFF
+
 #ifndef WIN32
 extern PGDLLIMPORT unsigned long UsedShmemSegID;
 #else
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 4b487a2a4e8..6ed0a5d2949 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -55,6 +55,17 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+int
+pg_numa_interleave_memptr(void *ptr, size_t sz)
+{
+	struct bitmask *nodemask = numa_allocate_nodemask();
+
+	numa_bitmask_setall(nodemask);
+	numa_interleave_memory(ptr, sz, nodemask);
+	numa_free_nodemask(nodemask);
+	return 0;
+}
+
 #else
 
 /* Empty wrappers */
@@ -77,4 +88,10 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+int
+pg_numa_interleave_memptr(void *ptr, size_t sz)
+{
+	return 0;
+}
+
 #endif
-- 
2.39.5

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Jakub Wartak (#1)

Re: NUMA shared memory interleaving

On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)

I have no answers but I have speculated for years about a very
specific case (without any idea where to begin due to lack of ... I
guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
workers split up and try to work on different batches on their own to
minimise contention, and when that's not possible (more workers than
batches, or finishing their existing work at different times and going
to help others), they just proceed in round-robin order. A beginner
thought is: if you're going to help someone working on a hash table,
it would surely be best to have the CPUs and all the data on the same
NUMA node. During loading, cache line ping pong would be cheaper, and
during probing, it *might* be easier to tune explicit memory prefetch
timing that way as it would look more like a single node system with a
fixed latency, IDK (I've shared patches for prefetching before that
showed pretty decent speedups, and the lack of that feature is
probably a bigger problem than any of this stuff, who knows...).
Another beginner thought is that the DSA allocator is a source of
contention during loading: the dumbest problem is that the chunks are
just too small, but it might also be interesting to look into per-node
pools. Or something. IDK, just some thoughts...

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Thomas Munro (#2)

Re: NUMA shared memory interleaving

On Thu, Apr 17, 2025 at 1:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I have no answers but I have speculated for years about a very
specific case (without any idea where to begin due to lack of ... I
guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
workers split up and try to work on different batches on their own to
minimise contention, and when that's not possible (more workers than
batches, or finishing their existing work at different times and going
to help others), they just proceed in round-robin order. A beginner
thought is: if you're going to help someone working on a hash table,
it would surely be best to have the CPUs and all the data on the same
NUMA node. During loading, cache line ping pong would be cheaper, and
during probing, it *might* be easier to tune explicit memory prefetch
timing that way as it would look more like a single node system with a
fixed latency, IDK (I've shared patches for prefetching before that
showed pretty decent speedups, and the lack of that feature is
probably a bigger problem than any of this stuff, who knows...).
Another beginner thought is that the DSA allocator is a source of
contention during loading: the dumbest problem is that the chunks are
just too small, but it might also be interesting to look into per-node
pools. Or something. IDK, just some thoughts...

And BTW there are papers about that (but they mostly just remind me
that I have to reboot the prefetching patch long before that...), for
example:

https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjoins/lang-imdm2013.pdf

Robert Haas

robertmhaas@gmail.com

9 months ago

In reply to: Jakub Wartak (#1)

Re: NUMA shared memory interleaving

On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.

I wonder whether the GUC needs to support interleaving between a
designated set of nodes rather than only being able to do all nodes.
For example, suppose someone is pinning the processes to a certain set
of NUMA nodes; perhaps then they wouldn't want to use memory from
other nodes.

--
Robert Haas
EDB: http://www.enterprisedb.com

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

9 months ago

In reply to: Robert Haas (#4)

Re: NUMA shared memory interleaving

Hi,

On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote:

On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.

I wonder whether the GUC needs to support interleaving between a
designated set of nodes rather than only being able to do all nodes.
For example, suppose someone is pinning the processes to a certain set
of NUMA nodes; perhaps then they wouldn't want to use memory from
other nodes.

+1. That could be used for instances consolidation on the same host. One could
ensure that numa nodes are not shared across instances (cpu and memory resource
isolation per instance). Bonus point, adding Direct I/O into the game would
ensure that the OS page cache is not shared too.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

9 months ago

In reply to: Thomas Munro (#2)

Re: NUMA shared memory interleaving

Hi,

On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote:

On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)

I have no answers but I have speculated for years about a very
specific case (without any idea where to begin due to lack of ... I
guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
workers split up and try to work on different batches on their own to
minimise contention, and when that's not possible (more workers than
batches, or finishing their existing work at different times and going
to help others), they just proceed in round-robin order. A beginner
thought is: if you're going to help someone working on a hash table,
it would surely be best to have the CPUs and all the data on the same
NUMA node. During loading, cache line ping pong would be cheaper, and
during probing, it *might* be easier to tune explicit memory prefetch
timing that way as it would look more like a single node system with a
fixed latency, IDK (I've shared patches for prefetching before that
showed pretty decent speedups, and the lack of that feature is
probably a bigger problem than any of this stuff, who knows...).
Another beginner thought is that the DSA allocator is a source of
contention during loading: the dumbest problem is that the chunks are
just too small, but it might also be interesting to look into per-node
pools. Or something. IDK, just some thoughts...

I'm also thinking that could be beneficial for parallel workers. I think the
ideal scenario would be to have the parallel workers spread across numa nodes and
accessing their "local" memory first (and help with "remote" memory access if
there is still more work to do "remotely").

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jakub Wartak

jakub.wartak@enterprisedb.com

7 months ago

In reply to: Bertrand Drouvot (#5)

1 attachment(s)

Re: NUMA shared memory interleaving

On Fri, Apr 18, 2025 at 7:43 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote:

On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.

I wonder whether the GUC needs to support interleaving between a
designated set of nodes rather than only being able to do all nodes.
For example, suppose someone is pinning the processes to a certain set
of NUMA nodes; perhaps then they wouldn't want to use memory from
other nodes.

+1. That could be used for instances consolidation on the same host. One could
ensure that numa nodes are not shared across instances (cpu and memory resource
isolation per instance). Bonus point, adding Direct I/O into the game would
ensure that the OS page cache is not shared too.

Hi, the attached patch has two changes:
1. It adds more modes and supports this 'consolidation' and
'isolation' scenario from above. Doc in patch briefly explains the
merit.
2. it adds trivial NUMA for PQ

The original initial test expanded on the very same machine
(4s32c128t, QPI interconnect):

numa='off'
latency average = 1271.019 ms
latency stddev = 245.061 ms
tps = 49.683923 (without initial connection time)
explanation(pcm-memory): 3 sockets doing ~46MB/s on RAM (almost
idle), 1 socket doing ~17GB/s (fully saturated because s_b ended up in
this scenario only on NUMA node)

numa='all'
latency average = 702.622 ms
latency stddev = 13.259 ms
tps = 90.026526 (without initial connection time)
explanation(pcm-memory): this forced to interleave s_b across 4
NUMA nodes and each socket gets equal part of workload (9.2 - 10GB/s)
totalling ~37GB/s of memory bandwidth

This gives a boost: 90/49.6=1.8x. The values for memory bandwidth are
combined read+write.

NUMA impact on the Parallel Query:
----------------------------------
with:
with the most simplistic interleaving of s_b +
dynamic_shared_memory for PQ interleaved too :
max_worker_processes=max_parallel_workers=max_parallel_workers_per_gather=64
alter on 1 partition to force real 64 parallel seq scans
The query:
select sum(octet_length(filler)) from pgbench_accounts;
launched 64 effective parallel workes launched for 64 partitions each
of 400MB (25600MBs), All of that was fitting in the s_b (32GB), so
all fetched from s_b. All was hot, several first runs are not shown.

select sum(octet_length(filler)) from pgbench_accounts;

numa='off'
Time: 1108.178 ms (00:01.108)
Time: 1118.494 ms (00:01.118)
Time: 1104.491 ms (00:01.104)
Time: 1112.221 ms (00:01.112)
Time: 1105.501 ms (00:01.106)
avg: 1109 ms

-- not interleaved, more like appended:
postgres=# select * from pg_shmem_allocations_numa where name =
'Buffer Blocks';
name | numa_node | size
---------------+-----------+------------
Buffer Blocks | 0 | 9277800448
Buffer Blocks | 1 | 7044333568
Buffer Blocks | 2 | 9097445376
Buffer Blocks | 3 | 8942256128

numa='all'
Time: 1026.747 ms (00:01.027)
Time: 1024.087 ms (00:01.024)
Time: 1024.179 ms (00:01.024)
Time: 1037.026 ms (00:01.037)
avg: 1027 ms

postgres=# select * from pg_shmem_allocations_numa where name
= 'Buffer Blocks';
name | numa_node | size
---------------+-----------+------------
Buffer Blocks | 0 | 8589934592
Buffer Blocks | 1 | 8592031744
Buffer Blocks | 2 | 8589934592
Buffer Blocks | 3 | 8589934592

1109/1027=1.079x, not bad for such trivial change and the paper
referenced by Thomas also stated (`We can see an improvement by a
factor of more than three by just running
the non-NUMA-aware implementation on interleaved memor`), probably it
could be improved much further, but I'm not planning to work on this
more. So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this

The patch supports e.g. numa='@1' which should fully isolate the
workload to just memory and CPUs on node #1.
As for the patch: I'm lost with our C headers policy :)

One of less obvious reasons (outside of better efficiency of
consolidation of multiple PostgreSQL cluster on single NUMA server),
why I've implemented '=' and '@' is that seems that CXL memory can be
attached as a CPU-less(!) NUMA node, thus Linux - depending on
sysctls/sysfs setup - could use it for automatic memory tiering and
the above provides configurable way to prevent allocation on such
(potential) systems - simply exclude such NUMA node via config for now
and we are covered I think. I have no access to real hardware, so I
haven't researched it further, but it looks like in the far future we
could probably indicate preferred NUMA memory nodes (think big s_b,
bigger than "CPU" RAM), and that kernel could transparently do NUMA
auto balancing/demotion for us (AKA Transparent Page Placement AKA
memory) or vice versa: use small s_b and do not use CXL node at all
and expect that VFS cache could be spilled there.
numa_weighted_interleave_memory() / MPOL_WEIGHTED_INTERLEAVE is not
yet supported in distros (although new libnuma has support for it), so
I have not included it in the patch, as it was too early.

BTW: DO NOT USE meson's --buildtype=debug as it somehow disables the
NUMA optimizations benefit, I've lost hours on it (probably -O0 is so
slow that it wasn't stressing interconnects enough). Default is
--buildtype=debugoptimized which is good. Also if testing performance,
check that HW that has proper realistic NUMA remote access distances
first, e.g. here my remote had remote access 2x or even 3x. Probably
this is worth only testing on multi-sockets which have really higher
latencies/throughput limitations, but reports from 1 socket MCMs CPUs
(with various Node-per-Socket BIOS settings) are welcome too.

kernel 6.14.7 was used with full isolation:
cpupower frequency-set --governor performance
cpupower idle-set -D0
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

max_connections = '10000'
huge_pages = 'on'
wal_level = 'minimal'
wal_buffers = '1024MB'
max_wal_senders = '0'
shared_buffers = '4 GB'
autovacuum = 'off'
max_parallel_workers_per_gather = '0'
numa = 'all'
#numa = 'off'

[1]: https://lwn.net/Articles/897536/

Attachments:

v4-0001-Add-capability-to-interleave-shared-memory-across.patchapplication/octet-stream; name=v4-0001-Add-capability-to-interleave-shared-memory-across.patchDownload

From da7fa0b8af9b75108bd4f0b50b25bdf1a2167473 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Tue, 24 Jun 2025 11:23:36 +0200
Subject: [PATCH v4] Add capability to interleave shared memory across multiple
 NUMA nodes.

Introduce new GUC numa=off(default)/auto/all/../=../@.. that might be used to
enable interleaving of shared memory. Until today, imbalances in shared memory
allocations on NUMA setups, may have caused non-deterministic performance
due to differences in latencies and bandwidths across interconnects ("remote"
access).

When provided list of nodes, the default is to use interleave memory on
preferred NUMA nodes, but support for more strict modes: pinning memory or
pinning both memory and CPU to specific NUMA node(s) is handled using special
'=' and '@' prefixes.

This is only supported on Linux with libnuma.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Inspired-by: Andres Freund <andres@anarazel.de>
Reviewed-by:
Discussion: https://postgr.es/m/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  26 ++++
 src/backend/port/sysv_shmem.c                 |  22 ++++
 src/backend/postmaster/postmaster.c           |  17 +++
 src/backend/storage/ipc/dsm_impl.c            |  13 ++
 src/backend/storage/ipc/shmem.c               |  76 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  14 +++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/port/pg_numa.h                    |  13 ++
 src/include/storage/pg_shmem.h                |  19 +++
 src/include/utils/guc_hooks.h                 |   2 +
 src/port/pg_numa.c                            | 114 +++++++++++++++++-
 11 files changed, 317 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b265cc89c9d..0ab5c519624 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2329,6 +2329,32 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-numa" xreflabel="numa">
+      <term><varname>numa</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>numa</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies wheter to use NUMA interleaving policy for the shared memory
+        segment. Possible values are <literal>off</literal>,
+        <literal>all</literal> (interleaves shared memory across all available NUMA nodes),
+        <literal>auto</literal> (as previous, but only if number of available NUMA nodes is 2 or higher)
+        or <literal>[=@]comma-separated list of node numbers or node ranges</literal>
+
+        If comma-separated list of NUMA nodes is prefixed with <literal>=</literal> the memory allocations
+        are made strict to avoid spilling to other NUMA nodes.
+        If comma-separated list of NUMA nodes is prefixed with <literal>@</literal> the memory allocations
+        are made strict and also available CPUs are limited only to those of listed NUMA nodes.
+
+        This parameter is only effective on Linux. Parallel Query interleaving is
+        only supported with <literal>dynamic_shared_memory</literal>=<literal>posix</literal>
+        The default value is <literal>off</literal>. This parameter can only be
+        set at server start.
+       </para>
+      </listitem>
+     </varlistentry>
      </variablelist>
      </sect2>
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..77af7c56ecd 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -29,6 +29,7 @@
 
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "port/pg_numa.h"
 #include "portability/mem.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
@@ -663,6 +664,27 @@ CreateAnonymousSegment(Size *size)
 						 allocsize) : 0));
 	}
 
+	if (numa->setting > NUMA_OFF)
+	{
+		/* In strict mode we want to ensure to not spill memory to another NUMA nodes */
+		int mem_bind_policy = numa->setting >= NUMA_STRICT_ONLY ? 1 : 0;
+
+		/* We do nothing in auto mode, if there is just one standard NUMA node */
+		if(numa->setting == NUMA_AUTO && pg_numa_get_max_node() <= 1) {
+			elog(DEBUG1, "no NUMA nodes found");
+		} else {
+			elog(LOG, "enabling NUMA shm interleaving");
+			pg_numa_interleave_memptr(ptr, allocsize, numa->nodes);
+
+			/* In NUMA_PREFERRED we can spill memory to other nodes, but not in STRICT modes */
+			pg_numa_set_bind_policy(mem_bind_policy);
+
+			/* We can also isolate CPUs to just isolated NUMA nodes */
+			if(numa->setting >= NUMA_STRICT_ONLY_AND_CPU_TOO)
+				pg_numa_bind(numa->nodes);
+		}
+	}
+
 	*size = allocsize;
 	return ptr;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce3664..bc9e3da8fa7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -100,6 +100,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/pgarch.h"
@@ -113,6 +114,7 @@
 #include "storage/fd.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "tcop/backend_startup.h"
@@ -453,6 +455,7 @@ static void StartSysLogger(void);
 static void StartAutovacuumWorker(void);
 static bool StartBackgroundWorker(RegisteredBgWorker *rw);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitNuma(void);
 
 #ifdef WIN32
 #define WNOHANG 0				/* ignored, so any integer value will do */
@@ -993,6 +996,9 @@ PostmasterMain(int argc, char *argv[])
 		ExitPostmaster(0);
 	}
 
+	/* Initialize libnuma if necessary */
+	InitNuma();
+
 	/*
 	 * Set up shared memory and semaphores.
 	 *
@@ -4616,3 +4622,14 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+
+static void
+InitNuma(void)
+{
+	if(numa->setting > NUMA_OFF) {
+		if (pg_numa_init() == -1)
+			elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+	}
+	return;
+}
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 6bf8ab5bb5b..46dcef48394 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -64,8 +64,10 @@
 #include "pgstat.h"
 #include "portability/mem.h"
 #include "postmaster/postmaster.h"
+#include "port/pg_numa.h"
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
+#include "storage/pg_shmem.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 
@@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 	}
 	*mapped_address = address;
 	*mapped_size = request_size;
+
+	/* We interleave memory only at creation time. */
+	if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) {
+		elog(DEBUG1, "interleaving shm mem @ %p size=%zu", *mapped_address, *mapped_size);
+		pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes);
+	}
+
 	close(fd);
 	ReleaseExternalFD();
 
@@ -588,6 +597,8 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size,
 	*mapped_address = address;
 	*mapped_size = request_size;
 
+	/* As dynamic_shared_memory=sysv is a bit legacy, we do not peform NUMA interleave here */
+
 	return true;
 }
 #endif
@@ -937,6 +948,8 @@ dsm_impl_mmap(dsm_op op, dsm_handle handle, Size request_size,
 	*mapped_address = address;
 	*mapped_size = request_size;
 
+	/* As dynamic_shared_memory=mmap is a bit legacy, we do not peform NUMA interleave here */
+
 	if (CloseTransientFile(fd) != 0)
 	{
 		ereport(elevel,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..bac84492e79 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -74,6 +74,10 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+#include <ctype.h>
+#include <numa.h>
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -765,3 +769,75 @@ pg_numa_available(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(pg_numa_init() != -1);
 }
+
+bool
+check_numa(char **newval, void **extra, GucSource source)
+{
+	bool		result = true;
+	NumaConfigData *n;
+	char	   *rawstring = *newval;
+
+	n = (NumaConfigData *) guc_malloc(LOG, sizeof(NumaConfigData));
+#ifndef USE_LIBNUMA
+	n->setting = NUMA_OFF;
+
+	if (!(strcmp(rawstring, "") == 0 || strcmp(rawstring, "off") == 0)) {
+
+		GUC_check_errdetail("\"%s\" is not supported on this platform.",
+							"numa");
+		result = false;
+	}
+#else
+
+	/* in case of just listing NUMA nodes it's list of preferred ones */
+	n->setting = NUMA_PREFERRED;
+
+	if (strcmp(rawstring, "") == 0)
+		n->setting = DEFAULT_NUMA;
+	else if (pg_strcasecmp(rawstring, "off") == 0)
+		n->setting = NUMA_OFF;
+	else if (pg_strcasecmp(rawstring, "all") == 0) {
+		n->setting = NUMA_ALL;
+		n->nodes = numa_all_nodes_ptr;
+	} else if (pg_strcasecmp(rawstring, "auto") == 0) {
+		n->setting = NUMA_AUTO;
+		n->nodes = numa_all_nodes_ptr;
+	} else if (isdigit(rawstring[0]))
+		n->setting = NUMA_PREFERRED;
+	else if (rawstring[0] == '=')
+		n->setting = NUMA_STRICT_ONLY;
+	else if (rawstring[0] == '@')
+		n->setting = NUMA_STRICT_ONLY_AND_CPU_TOO;
+	else {
+		GUC_check_errdetail("Invalid option \"%s\".", rawstring);
+		guc_free(n);
+		return false;
+	}
+
+	if(n->setting >= NUMA_PREFERRED) {
+		char *s = rawstring;
+
+		/* skip first character */
+		if(n->setting >= NUMA_STRICT_ONLY)
+			s++;
+
+		n->nodes = pg_numa_parse_nodestring(s);
+		if(n->nodes == 0) {
+			GUC_check_errdetail("Invalid list syntax in parameter \"%s\".",
+				"numa");
+			guc_free(n);
+			return false;
+		}
+	}
+
+#endif
+
+	*extra = n;
+	return result;
+}
+
+void
+assign_numa(const char *newval, void *extra)
+{
+	numa = (NumaConfigData *) extra;
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f04bfedb2fd..65b7ab7b5b0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -491,6 +491,7 @@ static const struct config_enum_entry file_copy_method_options[] = {
 	{NULL, 0, false}
 };
 
+
 /*
  * Options for enum values stored in other modules
  */
@@ -580,6 +581,8 @@ int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
 int			huge_pages_status = HUGE_PAGES_UNKNOWN;
 
+NumaConfigData *numa;
+
 /*
  * These variables are all dummies that don't do anything, except in some
  * cases provide the value for SHOW to display.  The real state is elsewhere
@@ -594,6 +597,7 @@ static char *server_version_string;
 static int	server_version_num;
 static char *debug_io_direct_string;
 static char *restrict_nonsystem_relation_kind_string;
+static char *numa_string;
 
 #ifdef HAVE_SYSLOG
 #define	DEFAULT_SYSLOG_FACILITY LOG_LOCAL0
@@ -4984,6 +4988,16 @@ struct config_string ConfigureNamesString[] =
 		check_log_connections, assign_log_connections, NULL
 	},
 
+	{
+		{"numa", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Whether to enable NUMA optimizations."),
+			NULL
+		},
+		&numa_string,
+		"",
+		check_numa, assign_numa, NULL
+	},
+
 
 	/* End-of-list marker */
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..d9e0c165a94 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -135,6 +135,8 @@
 					# (change requires restart)
 #huge_page_size = 0			# zero for system default
 					# (change requires restart)
+#numa = off				# off,all, auto, or comma list of NUMA nodes
+					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 40f1d324dcf..567cef3c505 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -14,9 +14,19 @@
 #ifndef PG_NUMA_H
 #define PG_NUMA_H
 
+// JW: is this legal to be included here?
+#include <numa.h>
+#include <numaif.h>
+
+typedef struct bitmask pg_numa_bitmask_t;
+
 extern PGDLLIMPORT int pg_numa_init(void);
 extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
 extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT int pg_numa_interleave_memptr(void *ptr, size_t sz, pg_numa_bitmask_t *mask);
+extern PGDLLIMPORT pg_numa_bitmask_t *pg_numa_parse_nodestring(const char *string);
+extern PGDLLIMPORT void pg_numa_set_bind_policy(int strict);
+extern PGDLLIMPORT void pg_numa_bind(pg_numa_bitmask_t *nodemask);
 
 #ifdef USE_LIBNUMA
 
@@ -27,6 +37,9 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void);
 #define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
 	ro_volatile_var = *(volatile uint64 *) ptr
 
+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);
+
 #else
 
 #define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 5f7d4b83a60..0c95fc4cdd0 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "port/pg_numa.h"
 
 typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 {
@@ -41,11 +42,17 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct NumaConfigData {
+	int				  setting;
+	pg_numa_bitmask_t *nodes;
+} NumaConfigData;
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
 extern PGDLLIMPORT int huge_pages_status;
+extern PGDLLIMPORT NumaConfigData *numa;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
@@ -64,6 +71,18 @@ typedef enum
 	SHMEM_TYPE_MMAP,
 }			PGShmemType;
 
+typedef enum
+{
+	NUMA_OFF,
+	NUMA_ALL,
+	NUMA_AUTO,
+	NUMA_PREFERRED,
+	NUMA_STRICT_ONLY,
+	NUMA_STRICT_ONLY_AND_CPU_TOO,
+}			NumaType;
+
+#define DEFAULT_NUMA NUMA_OFF
+
 #ifndef WIN32
 extern PGDLLIMPORT unsigned long UsedShmemSegID;
 #else
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 799fa7ace68..854a7dd02b4 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -94,6 +94,8 @@ extern bool check_multixact_member_buffers(int *newval, void **extra,
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
+extern bool check_numa(char **newval, void **extra, GucSource source);
+extern void assign_numa(const char *newval, void *extra);
 extern bool check_primary_slot_name(char **newval, void **extra,
 									GucSource source);
 extern bool check_random_seed(double *newval, void **extra, GucSource source);
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 4b487a2a4e8..6956f33ef44 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -13,10 +13,17 @@
  *-------------------------------------------------------------------------
  */
 
-#include "c.h"
+//JW:is this legal to replace "c.h" with below:
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
 #include <unistd.h>
 
 #include "port/pg_numa.h"
+#include "common/string.h"
 
 /*
  * At this point we provide support only for Linux thanks to libnuma, but in
@@ -55,6 +62,87 @@ pg_numa_get_max_node(void)
 	return numa_max_node();
 }
 
+int
+pg_numa_interleave_memptr(void *ptr, size_t sz, pg_numa_bitmask_t *mask)
+{
+	numa_interleave_memory(ptr, sz, mask);
+	return 0;
+}
+
+pg_numa_bitmask_t *
+pg_numa_parse_nodestring(const char *string)
+{
+	return numa_parse_nodestring(string);
+}
+
+void
+pg_numa_set_bind_policy(int strict)
+{
+	numa_set_bind_policy(strict);
+}
+
+void
+pg_numa_bind(pg_numa_bitmask_t *nodemask)
+{
+	numa_bind(nodemask);
+}
+
+#ifndef FRONTEND
+/*
+ * The standard libnuma built-in code can be seen here:
+ * https://github.com/numactl/numactl/blob/master/libnuma.c
+ *
+ */
+void
+numa_warn(int num, char *fmt,...)
+{
+	va_list		ap;
+	int			olde = errno;
+	int			needed;
+	StringInfoData msg;
+
+	initStringInfo(&msg);
+
+	va_start(ap, fmt);
+	needed = appendStringInfoVA(&msg, fmt, ap);
+	va_end(ap);
+	if (needed > 0)
+	{
+		enlargeStringInfo(&msg, needed);
+		va_start(ap, fmt);
+		appendStringInfoVA(&msg, fmt, ap);
+		va_end(ap);
+	}
+
+	/* chomp last newline character */
+	pg_strip_crlf(msg.data);
+
+	ereport(WARNING,
+			(errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
+			 errmsg_internal("libnuma: %s", msg.data)));
+
+	pfree(msg.data);
+
+	errno = olde;
+}
+
+void
+numa_error(char *where)
+{
+	int			olde = errno;
+
+	/* chomp last newline character */
+	pg_strip_crlf(where);
+
+	/*
+	 * XXX: for now we issue just WARNING, but long-term that might depend on
+	 * numa_set_strict() here.
+	 */
+	elog(WARNING, "libnuma: %s", where);
+	errno = olde;
+}
+#endif							/* FRONTEND */
+
 #else
 
 /* Empty wrappers */
@@ -77,4 +165,28 @@ pg_numa_get_max_node(void)
 	return 0;
 }
 
+int
+pg_numa_interleave_memptr(void *ptr, size_t sz, pg_numa_bitmask_t *mask)
+{
+	return 0;
+}
+
+pg_numa_bitmask_t *
+pg_numa_parse_nodestring(const char *string)
+{
+	return NULL;
+}
+
+void
+pg_numa_set_bind_policy(int strict)
+{
+	return;
+}
+
+void
+pg_numa_bind(pg_numa_bitmask_t *nodemask)
+{
+	return;
+}
+
 #endif
-- 
2.39.5

Jakub Wartak

jakub.wartak@enterprisedb.com

7 months ago

In reply to: Bertrand Drouvot (#6)

1 attachment(s)

Re: NUMA shared memory interleaving

On Fri, Apr 18, 2025 at 7:48 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote:

On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)

I have no answers but I have speculated for years about a very
specific case (without any idea where to begin due to lack of ... I
guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
workers split up and try to work on different batches on their own to
minimise contention, and when that's not possible (more workers than
batches, or finishing their existing work at different times and going
to help others), they just proceed in round-robin order. A beginner
thought is: if you're going to help someone working on a hash table,
it would surely be best to have the CPUs and all the data on the same
NUMA node. During loading, cache line ping pong would be cheaper, and
during probing, it *might* be easier to tune explicit memory prefetch
timing that way as it would look more like a single node system with a
fixed latency, IDK (I've shared patches for prefetching before that
showed pretty decent speedups, and the lack of that feature is
probably a bigger problem than any of this stuff, who knows...).
Another beginner thought is that the DSA allocator is a source of
contention during loading: the dumbest problem is that the chunks are
just too small, but it might also be interesting to look into per-node
pools. Or something. IDK, just some thoughts...

I'm also thinking that could be beneficial for parallel workers. I think the
ideal scenario would be to have the parallel workers spread across numa nodes and
accessing their "local" memory first (and help with "remote" memory access if
there is still more work to do "remotely").

Hi Bertrand, I've played with CPU pinning of PQ workers (via adjusting
postmaster pin), but I've got quite opposite results - please see
attached, especially "lat"ency against how the CPUs were assigned VS
NUMA/s_b when it was not interleaved. Not that I intend to spend a lot
of time researching PQ vs NUMA , but I've included interleaving of PQ
shm segments too in the v4 patch in the subthread nearby. Those
attached results here, were made some time ago with v1 of the patch
where PQ shm segment was not interleaved.

If anything it would be to hear if there are any sensible
production-like scenarios/workloads when dynamic_shared_memory should
be set to sysv or mmap (instead of default posix) ? Asking for Linux
only, I couldn't imagine anything (?)

-J.

Tomas Vondra

tomas@vondra.me

7 months ago

In reply to: Jakub Wartak (#7)

Re: NUMA shared memory interleaving

Hi,

I agree we should improve the behavior on NUMA systems. But I'm not sure
this patch does far enough, or perhaps the approach seems a bit too
blunt, ignoring some interesting stuff.

AFAICS the patch essentially does the same thing as

numactl --interleave=all

except that it only does that to shared memory, not to process private
memory (as if we called numa_set_localalloc). Which means it has some of
the problems people observe with --interleave=all.

In particular, this practically guarantees that (with 4K memory pages),
each buffer hits multiple NUMA nodes. Because with the first half will
do to node N, while the second half goes to node (N+1).

That doesn't seem great. It's likely better than a misbalanced system
with everything allocated on a single NUMA node, but I don't see how it
could be better than "balanced" properly warmed up system where the
buffers are not split like this.

But OK, admittedly this only happens for 4K memory pages, and serious
systems with a lot of memory are likely to use huge pages, which makes
this less of an issue (only the buffers crossing the page boundaries
might get split).

My bigger comment however is that the approach focuses on balancing the
nodes (i.e. ensuring each node gets a fair share of shared memory), and
is entirely oblivious to the internal structure of the shared memory.

* It interleaves the shared segment, but it has many pieces - shared
buffers are the largest but not the only one. Does it make sense to
interleave all the other pieces?

* Some of the pieces are tightly related. For example, we talk about
shared buffers as if it was one big array, but it actually is two arrays
- blocks and descriptors. Even if buffers don't get split between nodes
(thanks to huge pages), there's no guarantee the descriptor for the
buffer does not end on a different node.

* In fact, the descriptors are so much smaller that blocks that it's
practically guaranteed all descriptors will end up on a single node.

I could probably come up with a couple more similar items, but I think
you get the idea. I do think making Postgres NUMA-aware will require
figuring out how to distribute (or not distribute) different parts of
the shared memory, and do that explicitly. And do that in a way that
allows us to do other stuff in NUMA-aware way, e.g. have a separate
freelists and clocksweep for each NUMA node, etc.

That's something numa_interleave_memory simply can't do for us, and I
suppose it might also have other downsides on large instances. I mean,
doesn't it have to create a separate mapping for each memory page?
Wouldn't that be a bit inefficient/costly for big instances?

Of course, I'm not saying all this as a random passerby - I've been
working on a similar patch for a while, based on Andres' experimental
NUMA branch. It's far from complete/perfect, more of a PoC quality, but
I hope to share it on the mailing list sometime soon.

FWIW while I think the patch doesn't go far enough, there's one area
where I think it probably goes way too far - configurability. I agree
it's reasonable to allow running on a subset of nodes, e.g. to split the
system between multiple instances etc. But do we need to configure that
from Postgres? Aren't people likely to already use something like
containers or k8 anyway? I think we should just try to inherit this from
the environment, i.e. determine which nodes we're allowed to run, and
use that. Maybe we'll find we need to be smarter, but I think we caan
leave that for later.

regards

--
Tomas Vondra

#10

Jakub Wartak

jakub.wartak@enterprisedb.com

7 months ago

In reply to: Tomas Vondra (#9)

Re: NUMA shared memory interleaving

Hi Tomas!

On Fri, Jun 27, 2025 at 6:41 PM Tomas Vondra <tomas@vondra.me> wrote:

I agree we should improve the behavior on NUMA systems. But I'm not sure
this patch does far enough, or perhaps the approach seems a bit too
blunt, ignoring some interesting stuff.

AFAICS the patch essentially does the same thing as

numactl --interleave=all

except that it only does that to shared memory, not to process private
memory (as if we called numa_set_localalloc). Which means it has some of
the problems people observe with --interleave=all.

In particular, this practically guarantees that (with 4K memory pages),
each buffer hits multiple NUMA nodes. Because with the first half will
do to node N, while the second half goes to node (N+1).

That doesn't seem great. It's likely better than a misbalanced system
with everything allocated on a single NUMA node, but I don't see how it
could be better than "balanced" properly warmed up system where the
buffers are not split like this.

But OK, admittedly this only happens for 4K memory pages, and serious
systems with a lot of memory are likely to use huge pages, which makes
this less of an issue (only the buffers crossing the page boundaries
might get split).

My bigger comment however is that the approach focuses on balancing the
nodes (i.e. ensuring each node gets a fair share of shared memory), and
is entirely oblivious to the internal structure of the shared memory.

* It interleaves the shared segment, but it has many pieces - shared
buffers are the largest but not the only one. Does it make sense to
interleave all the other pieces?

* Some of the pieces are tightly related. For example, we talk about
shared buffers as if it was one big array, but it actually is two arrays
- blocks and descriptors. Even if buffers don't get split between nodes
(thanks to huge pages), there's no guarantee the descriptor for the
buffer does not end on a different node.

* In fact, the descriptors are so much smaller that blocks that it's
practically guaranteed all descriptors will end up on a single node.

I could probably come up with a couple more similar items, but I think
you get the idea. I do think making Postgres NUMA-aware will require
figuring out how to distribute (or not distribute) different parts of
the shared memory, and do that explicitly. And do that in a way that
allows us to do other stuff in NUMA-aware way, e.g. have a separate
freelists and clocksweep for each NUMA node, etc.

I do understand what you mean, but I'm *NOT* stating here that it
makes PG fully "NUMA-aware". I actually try to avoid doing so with
each sentence. This is only about the imbalance problem specifically.
I think we could build those follow-up optimizations as separate
patches in this or follow-up threads. If we would do it all in one
giant 0001 (without split) the very first question would be to
quantify the impact of each of those optimizations (for which we would
probably need more GUCs?). Here I'm just showing that the very first
baby step - interleaving - helps avoid interconnect saturation in some
cases too.

Anyway, even putting the fact that local mallocs() would be
interleaved, adjusting systemd startup scripts to just include
`numactl --interleave=all` sounds like some dirty hack not like proper
UX.

Also please note that:
* I do not have lot of time to dedicate towards it, yet I was kind of
always interested in researching that and wondering why we couldn't it
for such long time, therefore the previous observability work and now
$subject (note it is not claiming to be full blown NUMA awareness,
just some basic NUMA interleave as first [well, second?] step).
* I've raised this question in the first post "How to name this GUC
(numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
simply looks better, and we could just add way more stuff to it over
time (in PG19 or future versions?). Does that sound good?

That's something numa_interleave_memory simply can't do for us, and I
suppose it might also have other downsides on large instances. I mean,
doesn't it have to create a separate mapping for each memory page?
Wouldn't that be a bit inefficient/costly for big instances?

No? Or what kind of mapping do you have in mind? I think our shared
memory on the kernel side is just a single VMA (contiguous memory
region), on which technically we execute mbind() (libnuma is just a
wrapper around it). I have not observed any kind of regressions,
actually quite the opposite. Not sure what you also mean by 'big
instances' (AFAIK 1-2TB shared_buffers might even fail to start).

Of course, I'm not saying all this as a random passerby - I've been
working on a similar patch for a while, based on Andres' experimental
NUMA branch. It's far from complete/perfect, more of a PoC quality, but
I hope to share it on the mailing list sometime soon.

Cool, I didn't know Andres's branch was public till now, I know he
referenced multiple issues in presentation (and hackathon!), but I
wanted to divide it and try to get something in at least partially,
step by step, to have at least something. I think we should
collaborate (not a lot of people interested in this?) and I can try to
offer my limited help if you attack those more advanced problems. I
think we could improve this by properly ensuring that by
over(allocating)/spreading/padding certain special regions (e.g.
better distribute ProcArray, but what about cache hits?) - we get more
juice, or do you want to start from scratch and re-design/re-think all
shm allocations case by case?

FWIW while I think the patch doesn't go far enough, there's one area
where I think it probably goes way too far - configurability. I agree
it's reasonable to allow running on a subset of nodes, e.g. to split the
system between multiple instances etc. But do we need to configure that
from Postgres? Aren't people likely to already use something like
containers or k8 anyway?
I think we should just try to inherit this from
the environment, i.e. determine which nodes we're allowed to run, and
use that. Maybe we'll find we need to be smarter, but I think we caan
leave that for later.

That's what "numa=all" is all about (take whatever is there in the
OS/namespace), but I do not know a better way than just let's say
numa_get_mems_allowed() being altered somehow by namespace/cgroups. I
think if one runs on k8/containers then it's quite limited/small
deployment and he wouldn't benefit from this at all (I struggle to
imagine the point of k8 pod using 2+ sockets), quite contrary: my
experience indicates that the biggest deployments are usually almost
baremetal? And it's way easier to get consistent results. Anyway as
You say, let's leave it for later. PG currently often is not CPU-aware
(i.e. is not even adjusting sizing of certain structs based on CPU
count), so making it NUMA-aware or cgroup/namespace-aware sounds
already like taking 2-3 steps ahead into future [I think we had
discussion at least one in LWLock partitionmanager /
FP_LOCK_SLOTS_PER_BACKEND where I've proposed to size certain
structures based on $VCPUs or I am misremembering this]

-J.

#11

Jakub Wartak

jakub.wartak@enterprisedb.com

7 months ago

In reply to: Jakub Wartak (#10)

Re: NUMA shared memory interleaving

On Mon, Jun 30, 2025 at 12:55 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
[..]

FWIW while I think the patch doesn't go far enough, there's one area
where I think it probably goes way too far - configurability. I agree
it's reasonable to allow running on a subset of nodes, e.g. to split the
system between multiple instances etc. But do we need to configure that
from Postgres? Aren't people likely to already use something like
containers or k8 anyway?
I think we should just try to inherit this from
the environment, i.e. determine which nodes we're allowed to run, and
use that. Maybe we'll find we need to be smarter, but I think we caan
leave that for later.

That's what "numa=all" is all about (take whatever is there in the
OS/namespace)

My error, that should be: that's what "numa=AUTO" is all about (..)

-J.

#12

Tomas Vondra

tomas@vondra.me

7 months ago

In reply to: Jakub Wartak (#10)

Re: NUMA shared memory interleaving

On 6/30/25 12:55, Jakub Wartak wrote:

Hi Tomas!

On Fri, Jun 27, 2025 at 6:41 PM Tomas Vondra <tomas@vondra.me> wrote:

I agree we should improve the behavior on NUMA systems. But I'm not sure
this patch does far enough, or perhaps the approach seems a bit too
blunt, ignoring some interesting stuff.

AFAICS the patch essentially does the same thing as

numactl --interleave=all

except that it only does that to shared memory, not to process private
memory (as if we called numa_set_localalloc). Which means it has some of
the problems people observe with --interleave=all.

In particular, this practically guarantees that (with 4K memory pages),
each buffer hits multiple NUMA nodes. Because with the first half will
do to node N, while the second half goes to node (N+1).

That doesn't seem great. It's likely better than a misbalanced system
with everything allocated on a single NUMA node, but I don't see how it
could be better than "balanced" properly warmed up system where the
buffers are not split like this.

But OK, admittedly this only happens for 4K memory pages, and serious
systems with a lot of memory are likely to use huge pages, which makes
this less of an issue (only the buffers crossing the page boundaries
might get split).

My bigger comment however is that the approach focuses on balancing the
nodes (i.e. ensuring each node gets a fair share of shared memory), and
is entirely oblivious to the internal structure of the shared memory.

* It interleaves the shared segment, but it has many pieces - shared
buffers are the largest but not the only one. Does it make sense to
interleave all the other pieces?

* Some of the pieces are tightly related. For example, we talk about
shared buffers as if it was one big array, but it actually is two arrays
- blocks and descriptors. Even if buffers don't get split between nodes
(thanks to huge pages), there's no guarantee the descriptor for the
buffer does not end on a different node.

* In fact, the descriptors are so much smaller that blocks that it's
practically guaranteed all descriptors will end up on a single node.

I could probably come up with a couple more similar items, but I think
you get the idea. I do think making Postgres NUMA-aware will require
figuring out how to distribute (or not distribute) different parts of
the shared memory, and do that explicitly. And do that in a way that
allows us to do other stuff in NUMA-aware way, e.g. have a separate
freelists and clocksweep for each NUMA node, etc.

I do understand what you mean, but I'm *NOT* stating here that it
makes PG fully "NUMA-aware". I actually try to avoid doing so with
each sentence. This is only about the imbalance problem specifically.
I think we could build those follow-up optimizations as separate
patches in this or follow-up threads. If we would do it all in one
giant 0001 (without split) the very first question would be to
quantify the impact of each of those optimizations (for which we would
probably need more GUCs?). Here I'm just showing that the very first
baby step - interleaving - helps avoid interconnect saturation in some
cases too.

Anyway, even putting the fact that local mallocs() would be
interleaved, adjusting systemd startup scripts to just include
`numactl --interleave=all` sounds like some dirty hack not like proper
UX.

I wasn't suggesting to do "numactl --interleave=all". My argument was
simply that doing numa_interleave_memory() has most of the same issues,
because it's oblivious to what's stored in the shared memory. Sure, the
fact that local memory is not interleaved too is an improvement.

But I just don't see how this could be 0001, followed by some later
improvements. ISTM the improvements would have to largely undo 0001
first, and it would be nontrivial if an optimization needs to do that
only for some part of the shared memory.

Also please note that:
* I do not have lot of time to dedicate towards it, yet I was kind of
always interested in researching that and wondering why we couldn't it
for such long time, therefore the previous observability work and now
$subject (note it is not claiming to be full blown NUMA awareness,
just some basic NUMA interleave as first [well, second?] step).

Sorry, I appreciate the time you spent working on these features. It
wasn't my intention to dunk on your patch. I'm afraid this is an example
of how reactions on -hackers are often focused on pointing out issues. I
apologize for that, I should have realized it earlier.

I certainly agree it'd be good to improve the NUMA support, otherwise I
wouldn't be messing with Andres' PoC patches myself.

* I've raised this question in the first post "How to name this GUC
(numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
simply looks better, and we could just add way more stuff to it over
time (in PG19 or future versions?). Does that sound good?

I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for
different parts of the shared memory. But that's mostly for development,
to allow easy experimentation. I don't have a clear idea what UX should
look like.

That's something numa_interleave_memory simply can't do for us, and I
suppose it might also have other downsides on large instances. I mean,
doesn't it have to create a separate mapping for each memory page?
Wouldn't that be a bit inefficient/costly for big instances?

No? Or what kind of mapping do you have in mind? I think our shared
memory on the kernel side is just a single VMA (contiguous memory
region), on which technically we execute mbind() (libnuma is just a
wrapper around it). I have not observed any kind of regressions,
actually quite the opposite. Not sure what you also mean by 'big
instances' (AFAIK 1-2TB shared_buffers might even fail to start).

Something as simple as giving a contiguous chunk of to each NUMA node.
Essentially 1/nodes goes to the first NUMA node, and so on. I haven't
looked into the details of how NUMA interleaving works, but from the
discussions I had about it, I understood it might be expensive. Not
sure, maybe that's wrong.

But the other reason for a simpler mapping is that it seems useful to be
able to easily calculate which NUMA node a buffer belongs to. Because
then you can do NUMA-aware freelists, clocksweep, etc.

Of course, I'm not saying all this as a random passerby - I've been
working on a similar patch for a while, based on Andres' experimental
NUMA branch. It's far from complete/perfect, more of a PoC quality, but
I hope to share it on the mailing list sometime soon.

Cool, I didn't know Andres's branch was public till now, I know he
referenced multiple issues in presentation (and hackathon!), but I
wanted to divide it and try to get something in at least partially,
step by step, to have at least something. I think we should
collaborate (not a lot of people interested in this?) and I can try to
offer my limited help if you attack those more advanced problems. I
think we could improve this by properly ensuring that by
over(allocating)/spreading/padding certain special regions (e.g.
better distribute ProcArray, but what about cache hits?) - we get more
juice, or do you want to start from scratch and re-design/re-think all
shm allocations case by case?

+1 to collaboration, absolutely. I was actually planning to ping you
once I have something workable. I hope I'll be able to polish the WIP
patches a little bit and post them sometime this week.

FWIW while I think the patch doesn't go far enough, there's one area
where I think it probably goes way too far - configurability. I agree
it's reasonable to allow running on a subset of nodes, e.g. to split the
system between multiple instances etc. But do we need to configure that
from Postgres? Aren't people likely to already use something like
containers or k8 anyway?
I think we should just try to inherit this from
the environment, i.e. determine which nodes we're allowed to run, and
use that. Maybe we'll find we need to be smarter, but I think we caan
leave that for later.

That's what "numa=all" is all about (take whatever is there in the
OS/namespace), but I do not know a better way than just let's say
numa_get_mems_allowed() being altered somehow by namespace/cgroups. I
think if one runs on k8/containers then it's quite limited/small
deployment and he wouldn't benefit from this at all (I struggle to
imagine the point of k8 pod using 2+ sockets), quite contrary: my
experience indicates that the biggest deployments are usually almost
baremetal? And it's way easier to get consistent results. Anyway as
You say, let's leave it for later. PG currently often is not CPU-aware
(i.e. is not even adjusting sizing of certain structs based on CPU
count), so making it NUMA-aware or cgroup/namespace-aware sounds
already like taking 2-3 steps ahead into future [I think we had
discussion at least one in LWLock partitionmanager /
FP_LOCK_SLOTS_PER_BACKEND where I've proposed to size certain
structures based on $VCPUs or I am misremembering this]

+1 to leave this for later, we can worry about this once we have it
working with the basic whole-system NUMA setups. I hope people doing
some of this would give us feedback what config they actually need.

regards

--
Tomas Vondra

#13

Jakub Wartak

jakub.wartak@enterprisedb.com

6 months ago

In reply to: Tomas Vondra (#12)

Re: NUMA shared memory interleaving

On Mon, Jun 30, 2025 at 9:23 PM Tomas Vondra <tomas@vondra.me> wrote:

I wasn't suggesting to do "numactl --interleave=all". My argument was
simply that doing numa_interleave_memory() has most of the same issues,
because it's oblivious to what's stored in the shared memory. Sure, the
fact that local memory is not interleaved too is an improvement.

... and that's enough for me to start this ;)

But I just don't see how this could be 0001, followed by some later
improvements. ISTM the improvements would have to largely undo 0001
first, and it would be nontrivial if an optimization needs to do that
only for some part of the shared memory.

OK, maybe I'll back-off a bit to see Your ideas first. It seems you
are thinking about having multiple separate shared memory segments.

I certainly agree it'd be good to improve the NUMA support, otherwise I
wouldn't be messing with Andres' PoC patches myself.

Yup, cool, let's stick to that.

* I've raised this question in the first post "How to name this GUC
(numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
simply looks better, and we could just add way more stuff to it over
time (in PG19 or future versions?). Does that sound good?

I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for
different parts of the shared memory. But that's mostly for development,
to allow easy experimentation.

[..]

I don't have a clear idea what UX should look like.

Later (after research/experiments), I could still imagine sticking to
one big `numa` switch like it's today in v4-0001, but maybe with
additional 1-2 more `really_advanced_numa=stuff` (but not lots of
them, I would imagine e.g. that NUMA for analytics could be different
setup that NUMA for OLTP -- AKA do we want to optimize for
interconnect bandwidth or latency?).

That's something numa_interleave_memory simply can't do for us, and I
suppose it might also have other downsides on large instances. I mean,
doesn't it have to create a separate mapping for each memory page?
Wouldn't that be a bit inefficient/costly for big instances?

No? Or what kind of mapping do you have in mind? I think our shared
memory on the kernel side is just a single VMA (contiguous memory
region), on which technically we execute mbind() (libnuma is just a
wrapper around it). I have not observed any kind of regressions,
actually quite the opposite. Not sure what you also mean by 'big
instances' (AFAIK 1-2TB shared_buffers might even fail to start).

Something as simple as giving a contiguous chunk of to each NUMA node.

That would actually be multiple separate VMAs/shared memory regions
(main, and specific ones for in case of NUMA - per structure) and
potentially - speculating here - slower fork()?

Related, the only complaint about memory allocated via
mmap(MAP_SHARED|MAP_HUGETLB) with NUMA, I have so far is that if the
per-zone HP free memory is too small (especially with HP=on), it
starts to spill over to the others nodes without interleaving and
without notification, you may have the same problem here unless it is
strict allocation.

Essentially 1/nodes goes to the first NUMA node, and so on. I haven't
looked into the details of how NUMA interleaving works, but from the
discussions I had about it, I understood it might be expensive. Not
sure, maybe that's wrong.

I would really like to hear the argument how NUMA interleaving is
expensive on the kernel side. It's literally bandwidth over latency. I
can buy the argument that e.g. by having dedicated mmap(MAP_SHARED)
for ProcArray[] (assuming we are HP=on/2MB), and by having a smaller
page size just for this stuct (sizeof() =~ 832b? so let's assume even
wasting 4kB per entry), could better enable kernel's NUMA
autobalancing to better relocate those necessary pages closer to the
active processes (warning: I'm making lots of assumptions, haven't
really checked memory access patterns for this struct). No idea how
bad it would be on the CPU caches too, though by making it so big here
in this theoretical context. But the easy counter argument also could
be: smaller page size = no HP available --> potentially making it
__swap-able__ and potentially causing worse dTLB hit-rates? ... and we
are just discussing a single shared memory entry and there are 73 :)

My take is that by doing it is - as an opt-in - basic interleaving is
safe, proven and and gives *more* predictable latency that without it
(of course as You mention we could do better with some allocation for
specific structures, but how do You know where CPU scheduler puts
stuff?) I think we would need to limit ourselves to just optimizing
the most crucial (hot) stuff, like ProcArray[], but probably doesn't
make sense for investigating structures like multixacts/substractions
in this attempt.

E.g. I could even imagine that we could boost standby's NUMA-awareness
too, just by putting most used memory (eg.g. XLOG) to the same NODE
that is used by startup/recovery and walreciever (by CPU pinning), not
sure is it worth the effort though in this attempt and the problem
would be: what to do with those low-level/optimized allocations after
pg_promote() to primary? In theory this quickly escalates to calling
interleave on that struct again, so maybe let's put it aside.

To sum up, my problem is that optimization possibilities are quite
endless, so we need to settle on something realistic, right?

But the other reason for a simpler mapping is that it seems useful to be
able to easily calculate which NUMA node a buffer belongs to. Because
then you can do NUMA-aware freelists, clocksweep, etc.

Yay, sounds pretty advanced!

+1 to collaboration, absolutely. I was actually planning to ping you
once I have something workable. I hope I'll be able to polish the WIP
patches a little bit and post them sometime this week.

Cool.

-J.

#14

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Jakub Wartak (#13)

Re: NUMA shared memory interleaving

On 7/1/25 11:04, Jakub Wartak wrote:

On Mon, Jun 30, 2025 at 9:23 PM Tomas Vondra <tomas@vondra.me> wrote:

I wasn't suggesting to do "numactl --interleave=all". My argument was
simply that doing numa_interleave_memory() has most of the same issues,
because it's oblivious to what's stored in the shared memory. Sure, the
fact that local memory is not interleaved too is an improvement.

... and that's enough for me to start this ;)

But I just don't see how this could be 0001, followed by some later
improvements. ISTM the improvements would have to largely undo 0001
first, and it would be nontrivial if an optimization needs to do that
only for some part of the shared memory.

OK, maybe I'll back-off a bit to see Your ideas first. It seems you
are thinking about having multiple separate shared memory segments.

I certainly agree it'd be good to improve the NUMA support, otherwise I
wouldn't be messing with Andres' PoC patches myself.

Yup, cool, let's stick to that.

* I've raised this question in the first post "How to name this GUC
(numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
simply looks better, and we could just add way more stuff to it over
time (in PG19 or future versions?). Does that sound good?

I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for
different parts of the shared memory. But that's mostly for development,
to allow easy experimentation.

[..]

I don't have a clear idea what UX should look like.

Later (after research/experiments), I could still imagine sticking to
one big `numa` switch like it's today in v4-0001, but maybe with
additional 1-2 more `really_advanced_numa=stuff` (but not lots of
them, I would imagine e.g. that NUMA for analytics could be different
setup that NUMA for OLTP -- AKA do we want to optimize for
interconnect bandwidth or latency?).

Maybe. I have no clear idea yet, but I'd like to keep the number of new
GUCs as low as possible.

That's something numa_interleave_memory simply can't do for us, and I
suppose it might also have other downsides on large instances. I mean,
doesn't it have to create a separate mapping for each memory page?
Wouldn't that be a bit inefficient/costly for big instances?

No? Or what kind of mapping do you have in mind? I think our shared
memory on the kernel side is just a single VMA (contiguous memory
region), on which technically we execute mbind() (libnuma is just a
wrapper around it). I have not observed any kind of regressions,
actually quite the opposite. Not sure what you also mean by 'big
instances' (AFAIK 1-2TB shared_buffers might even fail to start).

Something as simple as giving a contiguous chunk of to each NUMA node.

That would actually be multiple separate VMAs/shared memory regions
(main, and specific ones for in case of NUMA - per structure) and
potentially - speculating here - slower fork()?

I may be confused about what you mean by VMA, but it certainly does not
require creating separate shared memory segments. Interleaving also does
not require that. You can move a certain range of memory to a particular
NUMA node, and that's it.

We may end up with separate shared memory segments for different parts
of the shared memory (instead of having a single segment like now), e.g.
to support dynamic shared_buffers resizing. But even with that we'd have
a shared memory segment for each part, split between NUMA nodes.

Well, we'd probably want separate segments, because for some parts it's
not great to have 2MB pages, because it's too coarse. And you can only
use huge pages for the whole segment.

Related, the only complaint about memory allocated via
mmap(MAP_SHARED|MAP_HUGETLB) with NUMA, I have so far is that if the
per-zone HP free memory is too small (especially with HP=on), it
starts to spill over to the others nodes without interleaving and
without notification, you may have the same problem here unless it is
strict allocation.

Essentially 1/nodes goes to the first NUMA node, and so on. I haven't
looked into the details of how NUMA interleaving works, but from the
discussions I had about it, I understood it might be expensive. Not
sure, maybe that's wrong.

I would really like to hear the argument how NUMA interleaving is
expensive on the kernel side. It's literally bandwidth over latency. I
can buy the argument that e.g. by having dedicated mmap(MAP_SHARED)
for ProcArray[] (assuming we are HP=on/2MB), and by having a smaller
page size just for this stuct (sizeof() =~ 832b? so let's assume even
wasting 4kB per entry), could better enable kernel's NUMA
autobalancing to better relocate those necessary pages closer to the
active processes (warning: I'm making lots of assumptions, haven't
really checked memory access patterns for this struct). No idea how
bad it would be on the CPU caches too, though by making it so big here
in this theoretical context. But the easy counter argument also could
be: smaller page size = no HP available --> potentially making it
__swap-able__ and potentially causing worse dTLB hit-rates? ... and we
are just discussing a single shared memory entry and there are 73 :)

I admit I don't recall the exact details of why exactly interleaving
would be expensive on the kernel side. I've been told by smarter people
that might be the case, but I don't remember the exact explanation. And
maybe it isn't measurably more expensive ...

I've been focusing on the aspect that it makes certain things more
difficult, or even impossible ...

My take is that by doing it is - as an opt-in - basic interleaving is
safe, proven and and gives *more* predictable latency that without it
(of course as You mention we could do better with some allocation for
specific structures, but how do You know where CPU scheduler puts
stuff?) I think we would need to limit ourselves to just optimizing
the most crucial (hot) stuff, like ProcArray[], but probably doesn't
make sense for investigating structures like multixacts/substractions
in this attempt.

My argument is that if we allocate the structs "well" then we can do
something smart later, like pick a PGPROC placed on the current NUMA
node (at connection time), and possibly even pin it to that NUMA node so
that it doesn't get migrated.

This means not just the PROPROC itself, but also stuff like fast-path
locking arrays (which are stored separately). And interleaving could
easily place them on a different NUMA node.

E.g. I could even imagine that we could boost standby's NUMA-awareness
too, just by putting most used memory (eg.g. XLOG) to the same NODE
that is used by startup/recovery and walreciever (by CPU pinning), not
sure is it worth the effort though in this attempt and the problem
would be: what to do with those low-level/optimized allocations after
pg_promote() to primary? In theory this quickly escalates to calling
interleave on that struct again, so maybe let's put it aside.

To sum up, my problem is that optimization possibilities are quite
endless, so we need to settle on something realistic, right?

Perhaps. But maybe we should explore the possibilities first, before
just settling out on something at the very beginning. To make an
informed decision we need to know what the costs/benefits are, and we
need to understand what the "advanced" solution might look like, so that
we don't pick a design that makes that impossible.

But the other reason for a simpler mapping is that it seems useful to be
able to easily calculate which NUMA node a buffer belongs to. Because
then you can do NUMA-aware freelists, clocksweep, etc.

Yay, sounds pretty advanced!

+1 to collaboration, absolutely. I was actually planning to ping you
once I have something workable. I hope I'll be able to polish the WIP
patches a little bit and post them sometime this week.

Cool.

regards

--
Tomas Vondra

#15

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#14)

Re: NUMA shared memory interleaving

Hi Jakub,

FYI I've posted my experimental NUMA patch series here:

/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me

I've considered posting it to this thread, but it seemed sufficiently
different to start a new thread.

regards

--
Tomas Vondra