failed NUMA pages inquiry status: Operation not permitted

Started by Christoph Berg3 months ago24 messages
#1Christoph Berg
myon@debian.org
1 attachment(s)

src/test/regress/expected/numa.out | 13 +++
src/test/regress/expected/numa_1.out | 5 +

numa_1.out is catching this error:

ERROR: libnuma initialization failed or NUMA is not supported on this platform

This is what I'm getting when running PG18 in docker on Debian trixie
(libnuma 2.0.19).

However, on older distributions, the error is different:

postgres =# select * from pg_shmem_allocations_numa;
ERROR: XX000: failed NUMA pages inquiry status: Operation not permitted
LOCATION: pg_get_shmem_allocations_numa, shmem.c:691

This makes the numa regression tests fail in Docker on Debian bookworm
(libnuma 2.0.16) and older and all of the Ubuntu LTS releases.

The attached patch makes it accept these errors, but perhaps it would
be better to detect it in pg_numa_available().

Christoph

Attachments:

0001-numa-Catch-Operation-not-permitted-error.patchtext/x-diff; charset=us-asciiDownload
From bfa516b8c68203df8dccab168a729dc9823045dd Mon Sep 17 00:00:00 2001
From: Christoph Berg <myon@debian.org>
Date: Thu, 16 Oct 2025 13:24:56 +0200
Subject: [PATCH] numa: Catch "Operation not permitted" error

On older (before 2.0.19) libnuma versions, the error thrown when the
NUMA status cannot be inquired is different.
---
 .../expected/pg_buffercache_numa_2.out        | 21 +++++++++++++++++++
 src/test/regress/expected/numa_2.out          |  9 ++++++++
 2 files changed, 30 insertions(+)
 create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_2.out
 create mode 100644 src/test/regress/expected/numa_2.out

diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_2.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_2.out
new file mode 100644
index 00000000000..b970dd2eaf9
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_2.out
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_numa;
+ERROR:  failed NUMA pages inquiry: Operation not permitted
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR:  permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR:  failed NUMA pages inquiry: Operation not permitted
+RESET role;
diff --git a/src/test/regress/expected/numa_2.out b/src/test/regress/expected/numa_2.out
new file mode 100644
index 00000000000..b4c19f01f59
--- /dev/null
+++ b/src/test/regress/expected/numa_2.out
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR:  failed NUMA pages inquiry status: Operation not permitted
-- 
2.51.0

#2Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#1)
Re: failed NUMA pages inquiry status: Operation not permitted

On 10/16/25 13:38, Christoph Berg wrote:

src/test/regress/expected/numa.out | 13 +++
src/test/regress/expected/numa_1.out | 5 +

numa_1.out is catching this error:

ERROR: libnuma initialization failed or NUMA is not supported on this platform

This is what I'm getting when running PG18 in docker on Debian trixie
(libnuma 2.0.19).

However, on older distributions, the error is different:

postgres =# select * from pg_shmem_allocations_numa;
ERROR: XX000: failed NUMA pages inquiry status: Operation not permitted
LOCATION: pg_get_shmem_allocations_numa, shmem.c:691

This makes the numa regression tests fail in Docker on Debian bookworm
(libnuma 2.0.16) and older and all of the Ubuntu LTS releases.

It's probably more about the kernel version. What kernels are used by
these systems?

The attached patch makes it accept these errors, but perhaps it would
be better to detect it in pg_numa_available().

Not sure how would that work. It seems this is some sort of permission
check in numa_move_pages, that's not what pg_numa_available does. Also,
it may depending on the page queried (e.g. whether it's exclusive or
shared by multiple processes).

thanks

--
Tomas Vondra

#3Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#2)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

It's probably more about the kernel version. What kernels are used by
these systems?

It's the very same kernel, just different docker containers on the
same system. I did not investigate yet where the problem is coming
from, different libnuma versions seemed like the best bet.

Same (differing) results on both these systems:
Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux
Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux

Not sure how would that work. It seems this is some sort of permission
check in numa_move_pages, that's not what pg_numa_available does. Also,
it may depending on the page queried (e.g. whether it's exclusive or
shared by multiple processes).

It's probably the lack of some process capability in that environment.
Maybe there is a way to query that, but I don't know much about that
yet.

Christoph

#4Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#3)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: To Tomas Vondra

It's the very same kernel, just different docker containers on the
same system. I did not investigate yet where the problem is coming
from, different libnuma versions seemed like the best bet.

numactl shows the problem already:

Host system:

$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
cpubind: 0
nodebind: 0
membind: 0
preferred:

debian:trixie-slim container:

$ numactl --show
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
No NUMA support available on this system.

debian:bookworm-slim container:

$ numactl --show
get_mempolicy: Operation not permitted
get_mempolicy: Operation not permitted
get_mempolicy: Operation not permitted
get_mempolicy: Operation not permitted
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
cpubind: 0
nodebind: 0
membind: 0
preferred:

Running with sudo does not change the result.

So maybe all that's needed is a get_mempolicy() call in
pg_numa_available() ?

Christoph

#5Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#4)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: To Tomas Vondra

So maybe all that's needed is a get_mempolicy() call in
pg_numa_available() ?

Or perhaps give up on pg_numa_available, and just have two _1.out and
_2.out that just contain the two different error messages, without
trying to catch the problem.

Christoph

#6Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#3)
Re: failed NUMA pages inquiry status: Operation not permitted

On 10/16/25 16:54, Christoph Berg wrote:

Re: Tomas Vondra

It's probably more about the kernel version. What kernels are used by
these systems?

It's the very same kernel, just different docker containers on the
same system. I did not investigate yet where the problem is coming
from, different libnuma versions seemed like the best bet.

Same (differing) results on both these systems:
Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux
Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux

Hmmm. Those seem like relatively recent kernels.

Not sure how would that work. It seems this is some sort of permission
check in numa_move_pages, that's not what pg_numa_available does. Also,
it may depending on the page queried (e.g. whether it's exclusive or
shared by multiple processes).

It's probably the lack of some process capability in that environment.
Maybe there is a way to query that, but I don't know much about that
yet.

move_page() manpage mentions PTRACE_MODE_READ_REALCREDS (man ptrace) so
maybe that's it.

--
Tomas Vondra

#7Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#4)
Re: failed NUMA pages inquiry status: Operation not permitted

So maybe all that's needed is a get_mempolicy() call in
pg_numa_available() ?

numactl 2.0.19 --show does this:

if (numa_available() < 0) {
show_physcpubind();
printf("No NUMA support available on this system.\n");
exit(1);
}

int numa_available(void)
{
if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && (errno == ENOSYS || errno == EPERM))
return -1;
return 0;
}

pg_numa_available is already calling numa_available.

But numactl 2.0.16 has this:

int numa_available(void)
{
if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS)
return -1;
return 0;
}

... which is not catching the "permission denied" error I am seeing.

So maybe PG should implement numa_available itself like that. (Or
accept the output difference so the regression tests are passing.)

Christoph

#8Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#7)
1 attachment(s)
Re: failed NUMA pages inquiry status: Operation not permitted

On 10/16/25 17:19, Christoph Berg wrote:

So maybe all that's needed is a get_mempolicy() call in
pg_numa_available() ?

...

So maybe PG should implement numa_available itself like that. (Or
accept the output difference so the regression tests are passing.)

I'm not sure which of those options is better. I'm a bit worried just
accepting the alternative output would hide some failures in the future
(although it's a low risk).

So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
attached patch. It still calls numa_available(), so that we don't
silently miss future libnuma changes.

Can you check this makes it work inside the docker container?

regards

--
Tomas Vondra

Attachments:

0001-Handle-EPERM-in-pg_numa_init.patchtext/x-patch; charset=UTF-8; name=0001-Handle-EPERM-in-pg_numa_init.patchDownload
From b5550ae6f5bac3de14a86a0f7677db755b27aa73 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 28 Oct 2025 16:00:07 +0100
Subject: [PATCH] Handle EPERM in pg_numa_init

---
 src/port/pg_numa.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..540ada3f8ef 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -47,7 +47,17 @@
 int
 pg_numa_init(void)
 {
-	int			r = numa_available();
+	int			r;
+
+	/*
+	 * XXX libnuma versions before 2.0.19 don't handle EPERM by disabling
+	 * NUMA, which then leads to unexpected failures later. This affects
+	 * containers that disable get_mempolicy by a seccomp profile.
+	 */
+	if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && (errno == EPERM))
+		r = -1;
+	else
+		r = numa_available();
 
 	return r;
 }
-- 
2.51.0

#9Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#7)
1 attachment(s)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: To Tomas Vondra

So maybe PG should implement numa_available itself like that.

Following our discussion at pgconf.eu last week, I just implemented
that. The numa and pg_buffercache tests pass in Docker on Debian
bookworm now.

Christoph

Attachments:

v2-0001-Make-pg_numa_init-cope-with-Docker.patchtext/x-diff; charset=us-asciiDownload
From 0b0088145b42a4316fb15a0ea4363bbebfabdfd7 Mon Sep 17 00:00:00 2001
From: Christoph Berg <myon@debian.org>
Date: Thu, 16 Oct 2025 13:24:56 +0200
Subject: [PATCH v2] Make pg_numa_init() cope with Docker

In seccomp-restricted environments like Docker, numactl versions before
2.0.19 would not properly catch EPERM. As the numa_available()
implementation is very short, just inline in here with the proper fix.

Upstream fix: https://github.com/numactl/numactl/commit/0ab9c7a0d857bea1724139c48e2e58ed6a81647f
---
 src/port/pg_numa.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 3368a43a338..932099be1e5 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -43,13 +43,20 @@
 #define NUMA_QUERY_CHUNK_SIZE 1024
 #endif
 
-/* libnuma requires initialization as per numa(3) on Linux */
+/*
+ * libnuma requires initialization as per numa(3) on Linux.
+ *
+ * This should ideally just return numa_available(), but numactl versions
+ * before 2.0.19 ignored EPERM from get_mempolicy(), leading to ugly error
+ * messages when used in seccomp-restricted environments like Docker. We just
+ * inline the 2.0.19 version of numa_available() here.
+ */
 int
 pg_numa_init(void)
 {
-	int			r = numa_available();
-
-	return r;
+	if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && (errno == ENOSYS || errno == EPERM))
+		return -1;
+	return 0;
 }
 
 /*
-- 
2.39.5

#10Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#8)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
attached patch. It still calls numa_available(), so that we don't
silently miss future libnuma changes.

Can you check this makes it work inside the docker container?

Yes your patch works. (Sorry I meant to test earlier, but RL...)

Christoph

#11Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#10)
Re: failed NUMA pages inquiry status: Operation not permitted

On 11/14/25 13:52, Christoph Berg wrote:

Re: Tomas Vondra

So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
attached patch. It still calls numa_available(), so that we don't
silently miss future libnuma changes.

Can you check this makes it work inside the docker container?

Yes your patch works. (Sorry I meant to test earlier, but RL...)

Thanks. I've pushed the fix (and backpatched to 18).

regards

--
Tomas Vondra

#12Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#11)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
attached patch. It still calls numa_available(), so that we don't
silently miss future libnuma changes.

Can you check this makes it work inside the docker container?

Yes your patch works. (Sorry I meant to test earlier, but RL...)

Thanks. I've pushed the fix (and backpatched to 18).

It looks like we are not done here yet :(

postgresql-18 is failing here intermittently with this diff:

12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out 2025-11-10 21:52:06.000000000 +0000
12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out 2025-12-11 11:20:22.618989603 +0000
12:20:24 @@ -6,8 +6,4 @@
12:20:24 -- switch to superuser
12:20:24 \c -
12:20:24 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
12:20:24 - ok
12:20:24 -----
12:20:24 - t
12:20:24 -(1 row)
12:20:24 -
12:20:24 +ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2

That's REL_18_STABLE @ 580b5c, with the Debian packaging on top.

I've seen it on unstable/amd64, unstable/arm64, and Ubuntu
questing/amd64, where libnuma should take care of this itself, without
the extra patch in PG. There was another case on bullseye/amd64 which
has the old libnuma.

It's been frequent enough so it killed 4 out of the 10 builds
currently visible on
https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/.
(Though to be fair, only one distribution/arch combination was failing
for each of them.)

There is also one instance of it in
https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/

I currently have no idea what's happening.

Christoph

#13Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#12)
Re: failed NUMA pages inquiry status: Operation not permitted

On 12/11/25 13:29, Christoph Berg wrote:

Re: Tomas Vondra

So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
attached patch. It still calls numa_available(), so that we don't
silently miss future libnuma changes.

Can you check this makes it work inside the docker container?

Yes your patch works. (Sorry I meant to test earlier, but RL...)

Thanks. I've pushed the fix (and backpatched to 18).

It looks like we are not done here yet :(

postgresql-18 is failing here intermittently with this diff:

12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out 2025-11-10 21:52:06.000000000 +0000
12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out 2025-12-11 11:20:22.618989603 +0000
12:20:24 @@ -6,8 +6,4 @@
12:20:24 -- switch to superuser
12:20:24 \c -
12:20:24 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
12:20:24 - ok
12:20:24 -----
12:20:24 - t
12:20:24 -(1 row)
12:20:24 -
12:20:24 +ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2

That's REL_18_STABLE @ 580b5c, with the Debian packaging on top.

I've seen it on unstable/amd64, unstable/arm64, and Ubuntu
questing/amd64, where libnuma should take care of this itself, without
the extra patch in PG. There was another case on bullseye/amd64 which
has the old libnuma.

It's been frequent enough so it killed 4 out of the 10 builds
currently visible on
https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/.
(Though to be fair, only one distribution/arch combination was failing
for each of them.)

There is also one instance of it in
https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/

I currently have no idea what's happening.

Hmmm, strange. -2 is ENOENT, which should mean this:

-ENOENT
The page is not present.

But what does "not present" mean in this context? And why would that be
only intermittent? Presumably this is still running in Docker, so maybe
it's another weird consequence of that?

regards

--
Tomas Vondra

#14Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#13)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

Hmmm, strange. -2 is ENOENT, which should mean this:

-ENOENT
The page is not present.

But what does "not present" mean in this context? And why would that be
only intermittent? Presumably this is still running in Docker, so maybe
it's another weird consequence of that?

Sorry I forgot to mention that this is now in the normal apt.pg.o
build environment (chroots without any funky permission restrictions).
I have not tried Docker yet.

I think it was not happening before the backport of the Docker fix.
But I have no idea why this should have broken anything, and why it
would only happen like 3% of the time.

Christoph

#15Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#13)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

Hmmm, strange. -2 is ENOENT, which should mean this:

-ENOENT
The page is not present.

But what does "not present" mean in this context? And why would that be
only intermittent? Presumably this is still running in Docker, so maybe
it's another weird consequence of that?

I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:

while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done

2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa

That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.

I tried reading the kernel source and it sounds related:

* If the source virtual memory range has any unmapped holes, or if
* the destination virtual memory range is not a whole unmapped hole,
* move_pages() will fail respectively with -ENOENT or -EEXIST. This
* provides a very strict behavior to avoid any chance of memory
* corruption going unnoticed if there are userland race conditions.
* Only one thread should resolve the userland page fault at any given
* time for any given faulting address. This means that if two threads
* try to both call move_pages() on the same destination address at the
* same time, the second thread will get an explicit error from this
* command.
...
* The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
* prevent -ENOENT errors to materialize if there are holes in the
* source virtual range that is being remapped. The holes will be
* accounted as successfully remapped in the retval of the
* command. This is mostly useful to remap hugepage naturally aligned
* virtual regions without knowing if there are transparent hugepage
* in the regions or not, but preventing the risk of having to split
* the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len, __u64 mode)
...
if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
err = -ENOENT;
break;

What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):

int numa_move_pages(int pid, unsigned long count,
void **pages, const int *nodes, int *status, int flags)
{
return move_pages(pid, count, pages, nodes, status, flags);
}

I guess the answer is somewhere in that gap.

ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2

Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)

Christoph

#16Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#15)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: To Tomas Vondra

I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:

while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done

2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa

That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.

I played a bit more with it.

* It seems to trigger only once for a running cluster. The next one
needs a restart
* If it doesn't trigger within the first 30s, it probably never will
* It seems easier to trigger on a system that is under load (I started
a few pgmodeler compile runs in parallel (C++))

But none of that answers the "why".

Christoph

#17Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#16)
Re: failed NUMA pages inquiry status: Operation not permitted

On 12/16/25 15:48, Christoph Berg wrote:

Re: To Tomas Vondra

I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:

while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done

2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa

That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.

I played a bit more with it.

* It seems to trigger only once for a running cluster. The next one
needs a restart
* If it doesn't trigger within the first 30s, it probably never will
* It seems easier to trigger on a system that is under load (I started
a few pgmodeler compile runs in parallel (C++))

But none of that answers the "why".

Hmmm, so this is interesting. I tried this on my workstation (with a
single NUMA node), and I see this:

1) right after opening a connection, I get this

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 290
-2 | 32478
(2 rows)

2) but a select from pg_shmem_allocations_numa works fine

test=# select numa_node, count(*) from pg_shmem_allocations_numa group by 1;
numa_node | count
-----------+-------
0 | 72
(1 row)

3) and if I repeat the pg_buffercache_numa query, it now works

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 32768
(1 row)

That's a bit strange. I have no idea why is this happening. If I
reconnect, I start getting the failures again.

regards

--
Tomas Vondra

#18Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#17)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

1) right after opening a connection, I get this

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 290
-2 | 32478

Does that mean that the "touch all pages" logic is missing in some
code paths?

But even with that, it seems to be able to degenerate again and
accepting -2 in the regression tests would be required to make it
stable.

Christoph

#19Tomas Vondra
tomas@vondra.me
In reply to: Christoph Berg (#18)
Re: failed NUMA pages inquiry status: Operation not permitted

On 12/16/25 18:54, Christoph Berg wrote:

Re: Tomas Vondra

1) right after opening a connection, I get this

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 290
-2 | 32478

Does that mean that the "touch all pages" logic is missing in some
code paths?

I did check and AFAICS we are touching the pages in pg_buffercache_numa.

To make it even more confusing, I can no longer reproduce the behavior I
reported yesterday. It just consistently reports "0" and I have no idea
why it changed :-( I did restart since yesterday, so maybe that changed
something.

But even with that, it seems to be able to degenerate again and
accepting -2 in the regression tests would be required to make it
stable.

No opinion yet. Either the -2 can happen occasionally, and then we'd
need to adjust the regression tests. Or maybe it's some thinko, and then
it'd be good to figure out why it's happening.

I find it interesting it does not seem to fail on the buildfarm. Or at
least I'm not aware of such failures. Even a rare failure should show
itself on the buildfarm a couple times, so how come it didn't?

regards

--
Tomas Vondra

#20Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#19)
Re: failed NUMA pages inquiry status: Operation not permitted

On 12/17/25 12:07, Tomas Vondra wrote:

On 12/16/25 18:54, Christoph Berg wrote:

Re: Tomas Vondra

1) right after opening a connection, I get this

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 290
-2 | 32478

Does that mean that the "touch all pages" logic is missing in some
code paths?

I did check and AFAICS we are touching the pages in pg_buffercache_numa.

To make it even more confusing, I can no longer reproduce the behavior I
reported yesterday. It just consistently reports "0" and I have no idea
why it changed :-( I did restart since yesterday, so maybe that changed
something.

I kept poking at this, and I managed to reproduce it again. The key
seems to be that the system needs to be under pressure, and then it's
reliably reproducible (at least for me).

What I did is I created two instances - one to keep the system busy, one
for experimentation. The "busy" one is set to use shared_buffers=16GB,
and then running read-only pgbench.

pgbench -i -s 4500 test
pgbench -S -j 16 -c 64 -T 600 -P 1 test

The system has 64GB of RAM and 12 cores, so this is a lot of load.

Then, the other instance is set to use shared_buffers=4GB, is started
and immediately queried for NUMA info for buffers (in a loop):

pg_ctl -D data -l pg.log start;

for r in $(seq 1 10); do
psql -p 5001 test -c 'select numa_node, count(*) from
pg_buffercache_numa group by 1';
done;

pg_ctl -D data -l pg.log stop;

And this often fails like this:

----------------------------------------------------------------------

waiting for server to start.... done
server started
numa_node | count
-----------+---------
0 | 1045302
-2 | 3274
(2 rows)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1025321
-2 | 23255
(2 rows)

numa_node | count
-----------+---------
0 | 1038596
-2 | 9980
(2 rows)

numa_node | count
-----------+---------
0 | 1048518
-2 | 58
(2 rows)

numa_node | count
-----------+---------
0 | 1048525
-2 | 51
(2 rows)

waiting for server to shut down.... done
server stopped

----------------------------------------------------------------------

So, it clearly fails quite often. And it can fail even later, after a
run that returned no "-2" buffers.

Clearly, something behaves differently than we thought. I've only seen
this happen on a system with swap - once I removed it, this behavior
disappeared too. So it seems a page can be moved to swap, in which case
we get -2 for a status.

In hindsight, that's not all that surprising. It's interesting it can
happen even with the "touching", but I guess there's a race condition
and the memory can get paged out before we inspect the status. We're
querying batches of pages, which probably makes the window larger.

FWIW I now realized I don't even need two instances. If I try this on
the "busy" instance, I get the -2 values too. Which I find a bit weird.
Because why should those be paged out?

The question is what to do about this. I don't think we can prevent the
-2 values, and error-ing out does not seem great either (most systems
have swap, so -2 may not be all that rare).

In fact, pg_shmem_allocations_numa probably should not error-out either,
because it's now reliably failing (on the busy instance).

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...

regards

--
Tomas Vondra

#21Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#20)
Re: failed NUMA pages inquiry status: Operation not permitted

Re: Tomas Vondra

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...

Or just not test that, or do something like

select numa_node = -2 or numa_node between 0 and 1000 from pg_shmem_allocations_numa;

Christoph

#22Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Christoph Berg (#21)
Re: failed NUMA pages inquiry status: Operation not permitted

On Mon, Jan 5, 2026 at 11:30 PM Christoph Berg <myon@debian.org> wrote:

Re: Tomas Vondra

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...

Hi Tomas! That's pretty wild, nice find about that swapping s_b thing!
So just to confirm, that was reproduced outside containers/docker,
right?

Or just not test that, or do something like

select numa_node = -2 or numa_node between 0 and 1000 from pg_shmem_allocations_numa;

Well, with the huge-pages it should be not swappable, so another idea
would be simply alter first line of src/test/regress/sql/numa.sql and
sql/pg_buffercache_numa.sql just like below:
- SELECT NOT(pg_numa_available()) AS skip_test \gset
+ SELECT (pg_numa_available() is false OR
current_setting('huge_pages_status')::bool is false) as skip_test
\gset

(I'm making assumption that there are buildfarm animals that
huge_pages enabled, no idea how to check that)

-J.

#23Tomas Vondra
tomas@vondra.me
In reply to: Jakub Wartak (#22)
Re: failed NUMA pages inquiry status: Operation not permitted

On 1/6/26 14:23, Jakub Wartak wrote:

On Mon, Jan 5, 2026 at 11:30 PM Christoph Berg <myon@debian.org> wrote:

Re: Tomas Vondra

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...

Hi Tomas! That's pretty wild, nice find about that swapping s_b thing!
So just to confirm, that was reproduced outside containers/docker,
right?

Yes, this is a regular bare-metal Debian system.

Or just not test that, or do something like

select numa_node = -2 or numa_node between 0 and 1000 from pg_shmem_allocations_numa;

Well, with the huge-pages it should be not swappable, so another idea
would be simply alter first line of src/test/regress/sql/numa.sql and
sql/pg_buffercache_numa.sql just like below:
- SELECT NOT(pg_numa_available()) AS skip_test \gset
+ SELECT (pg_numa_available() is false OR
current_setting('huge_pages_status')::bool is false) as skip_test
\gset

(I'm making assumption that there are buildfarm animals that
huge_pages enabled, no idea how to check that)

Yes, using huge pages makes this go away.

I'm also even more sure it's about swap, because /proc/PID/smaps for
postmaster tracks how much of the mapping is in swap, and with regular
memory pages I get values like this for the main shmem segment:

Swap: 90508 kB
Swap: 275272 kB
Swap: 135020 kB
Swap: 116460 kB
Swap: 102388 kB
Swap: 93832 kB
Swap: 155616 kB
Swap: 165692 kB

These are just values from "grep" while the pgbench is running. The
instance has 16GB shared buffers, so 200MB is close to 1%. Not a huge
part, but still ...

I've always "known" shared buffers could be swapped out, but I've never
realized it would affect cases like this one.

I'm not a huge fan of fixing just the tests. Sure, the tests will pass,
but what's the point of that if you then can't run this on production
because it also fails (I mean, the pg_shmem_allocations_numa will fail)?

I think it's clear we need to tweak this to handle -2 status. And then
also adjust tests to accept non-deterministic results.

regards

--
Tomas Vondra

#24Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#23)
Re: failed NUMA pages inquiry status: Operation not permitted

Hi Tomas,

On Tue, Jan 6, 2026 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote:
[..]

I've always "known" shared buffers could be swapped out, but I've never realized it would affect cases like this one.

Same, I'm a little surprised by it, but it makes sense. In my old and
more recent tests I've always reasoned the following way: NUMA (2+
sockets) --> probably a big production system --> huge_pages literally
always enabled to avoid a variety of surprises (locks the region).
Also this kind of reminds me of our previous past discussion about
dividing shm allocations into smaller requests (potentially 4kB shm
regions that are not huge_pages, so in theory swappable) [1]/messages/by-id/jqg6jd32sw4s6gjkezauer372xrww7xnupvrcsqkegh2uhv6vg@ppiwoigzz6v4.

I'm not a huge fan of fixing just the tests. Sure, the tests will pass,
but what's the point of that if you then can't run this on production
because it also fails (I mean, the pg_shmem_allocations_numa will fail)?

Well, You are probably right.

I think it's clear we need to tweak this to handle -2 status. And then
also adjust tests to accept non-deterministic results.

The only question remains is, if we want to expose it to the user or
not? We could

a) silently ignore ENOENT in the back branches so that "size" won't
contain it (well just change pg_get_shmem_allocations_numa()). It is
not part of any NUMA node anyway. Well, maybe we could emit DEBUG1 or
source code comment about such a fact that we think it may be swapped
out.

b) no sure is it a good idea, but in master we could expose it as a
new column "swapped_out_size" (or change the current datatype of
"numa" column from ::integer to something like ::text to allow
outputting numa_node as integer, but also putting node="swapped-out"
too with proper size). Sounds like a new minor feature that would be
able to tell the user that he has swapped out shm, and needs to really
enable huge pages (?)

-J.

[1]: /messages/by-id/jqg6jd32sw4s6gjkezauer372xrww7xnupvrcsqkegh2uhv6vg@ppiwoigzz6v4