Re: pgsql: Introduce pg_shmem_allocations_numa view

Started by Christoph Berg9 months ago53 messages
Jump to latest
#1Christoph Berg
myon@debian.org

Re: Tomas Vondra

Introduce pg_shmem_allocations_numa view

This is acting up on Debian's 32-bit architectures, namely i386, armel
and armhf:

--- /build/reproducible-path/postgresql-18-18~beta1+20250612/src/test/regress/expected/numa.out	2025-06-12 12:21:21.000000000 +0000
+++ /build/reproducible-path/postgresql-18-18~beta1+20250612/build/src/test/regress/results/numa.out	2025-06-12 20:20:33.124292694 +0000
@@ -6,8 +6,4 @@
 -- switch to superuser
 \c -
 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
- ok
-----
- t
-(1 row)
-
+ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -14

The diff is the same on all architectures.

-14 seems to be -EFAULT, and move_pages(2) says:

Page states in the status array
The following values can be returned in each element of the status array.

-EFAULT
This is a zero page or the memory area is not mapped by the process.

https://buildd.debian.org/status/logs.php?pkg=postgresql-18&ver=18%7Ebeta1%2B20250612-1
https://buildd.debian.org/status/fetch.php?pkg=postgresql-18&arch=armel&ver=18%7Ebeta1%2B20250612-1&stamp=1749759646&raw=0

Christoph

#2Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#1)

Re: To Tomas Vondra

This is acting up on Debian's 32-bit architectures, namely i386, armel
and armhf:

... and x32 (x86_64 instruction set with 32-bit pointers).

SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR: invalid NUMA node id outside of allowed range [0, 0]: -14

-14 seems to be -EFAULT, and move_pages(2) says:
-EFAULT
This is a zero page or the memory area is not mapped by the process.

I did some debugging on i386 and made it print the page numbers:

 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 35
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 36
...
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 32768
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 32769

So it works for the first few pages and then the rest is EFAULT.

I think the pg_numa_touch_mem_if_required() hack might not be enough
to force the pages to be allocated. Changing that to a memcpy() didn't
help. Is there some optimization that zero pages aren't allocated
until being written to?

Why do we try to force the pages to be allocated at all? This is just
a monitoring function, it should not change the actual system state.
Why not just skip any page where the status is <0 ?

The attached patch removes that logic. Regression tests pass, but we
probably have to think about whether to report these negative numbers
as-is or perhaps convert them to NULL.

Christoph

Attachments:

0001-Don-t-force-allocate-pages-for-pg_get_shmem_allocati.patchtext/x-diff; charset=us-asciiDownload+2-61
#3Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#2)

Re: To Tomas Vondra

Why do we try to force the pages to be allocated at all? This is just
a monitoring function, it should not change the actual system state.

One-time touching might also not be enough, what if the pages later
get swapped out and the monitoring functions are called again? They
will have to deal with these "not in memory" error conditions anyway.

Christoph

#4Andres Freund
andres@anarazel.de
In reply to: Christoph Berg (#3)

Hi,

On 2025-06-23 16:48:27 +0200, Christoph Berg wrote:

Re: To Tomas Vondra

Why do we try to force the pages to be allocated at all? This is just
a monitoring function, it should not change the actual system state.

The problem is that the kernel function just gives bogus results for pages
that *are* present in memory but that have only touched in another process
that has mapped the same range of memory.

One-time touching might also not be enough, what if the pages later
get swapped out and the monitoring functions are called again?

I don't think that's a problem, the process still has a relevant page table
entry in that case.

Greetings,

Andres Freund

#5Christoph Berg
myon@debian.org
In reply to: Andres Freund (#4)

Re: Andres Freund

Why do we try to force the pages to be allocated at all? This is just
a monitoring function, it should not change the actual system state.

The problem is that the kernel function just gives bogus results for pages
that *are* present in memory but that have only touched in another process
that has mapped the same range of memory.

Ok, so we leave the touching in, but still defend against negative
status values?

Christoph

#6Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#5)

Re: To Andres Freund

Ok, so we leave the touching in, but still defend against negative
status values?

v2 attached.

Christoph

Attachments:

v2-0001-Accept-unmapped-NUMA-pages.patchtext/x-diff; charset=us-asciiDownload+2-8
#7Andres Freund
andres@anarazel.de
In reply to: Christoph Berg (#6)

Hi,

On 2025-06-23 17:59:24 +0200, Christoph Berg wrote:

Re: To Andres Freund

Ok, so we leave the touching in, but still defend against negative
status values?

v2 attached.

How confident are we that this isn't actually because we passed a bogus
address to the kernel or such? With this patch, are *any* pages recognized as
valid on the machines that triggered the error?

I wonder if we ought to report the failures as a separate "numa node"
(e.g. NULL as node id) instead ...

Greetings,

Andres Freund

#8Christoph Berg
myon@debian.org
In reply to: Andres Freund (#7)

Re: Andres Freund

How confident are we that this isn't actually because we passed a bogus
address to the kernel or such? With this patch, are *any* pages recognized as
valid on the machines that triggered the error?

See upthread - the first 35 pages were ok, then a lot of -14.

I wonder if we ought to report the failures as a separate "numa node"
(e.g. NULL as node id) instead ...

Did that now, using N+1 (== 1 here) for errors in this Debian i386
environment (chroot on an amd64 host):

select * from pg_shmem_allocations_numa \crosstabview

name │ 0 │ 1
────────────────────────────────────────────────┼──────────┼──────────
multixact_offset │ 69632 │ 65536
subtransaction │ 139264 │ 131072
notify │ 139264 │ 0
Shared Memory Stats │ 188416 │ 131072
serializable │ 188416 │ 86016
PROCLOCK hash │ 4096 │ 0
FinishedSerializableTransactions │ 4096 │ 0
XLOG Ctl │ 2117632 │ 2097152
Shared MultiXact State │ 4096 │ 0
Proc Header │ 4096 │ 0
Archiver Data │ 4096 │ 0
.... more 0s in the last column ...
AioHandleData │ 1429504 │ 0
Buffer Blocks │ 67117056 │ 67108864
Buffer IO Condition Variables │ 266240 │ 0
Proc Array │ 4096 │ 0
.... more 0s
(73 rows)

There is something fishy with pg_buffercache. If I restart PG, I'm
getting "Bad address" (errno 14), this time as return value of
move_pages().

postgres =# select * from pg_buffercache_numa;
DEBUG: 00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:383
2025-06-23 19:41:41.315 UTC [1331894] ERROR: failed NUMA pages inquiry: Bad address
2025-06-23 19:41:41.315 UTC [1331894] STATEMENT: select * from pg_buffercache_numa;
ERROR: XX000: failed NUMA pages inquiry: Bad address
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:394

Repeated calls are fine.

Maybe NUMA is just not supported on 32-bit archs, but I'd rather be
sure about that before play that card.

Christoph

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#8)

On 6/23/25 21:57, Christoph Berg wrote:

Re: Andres Freund

How confident are we that this isn't actually because we passed a bogus
address to the kernel or such? With this patch, are *any* pages recognized as
valid on the machines that triggered the error?

See upthread - the first 35 pages were ok, then a lot of -14.

I wonder if we ought to report the failures as a separate "numa node"
(e.g. NULL as node id) instead ...

Did that now, using N+1 (== 1 here) for errors in this Debian i386
environment (chroot on an amd64 host):

select * from pg_shmem_allocations_numa \crosstabview

name │ 0 │ 1
────────────────────────────────────────────────┼──────────┼──────────
multixact_offset │ 69632 │ 65536
subtransaction │ 139264 │ 131072
notify │ 139264 │ 0
Shared Memory Stats │ 188416 │ 131072
serializable │ 188416 │ 86016
PROCLOCK hash │ 4096 │ 0
FinishedSerializableTransactions │ 4096 │ 0
XLOG Ctl │ 2117632 │ 2097152
Shared MultiXact State │ 4096 │ 0
Proc Header │ 4096 │ 0
Archiver Data │ 4096 │ 0
.... more 0s in the last column ...
AioHandleData │ 1429504 │ 0
Buffer Blocks │ 67117056 │ 67108864
Buffer IO Condition Variables │ 266240 │ 0
Proc Array │ 4096 │ 0
.... more 0s
(73 rows)

There is something fishy with pg_buffercache. If I restart PG, I'm
getting "Bad address" (errno 14), this time as return value of
move_pages().

postgres =# select * from pg_buffercache_numa;
DEBUG: 00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:383
2025-06-23 19:41:41.315 UTC [1331894] ERROR: failed NUMA pages inquiry: Bad address
2025-06-23 19:41:41.315 UTC [1331894] STATEMENT: select * from pg_buffercache_numa;
ERROR: XX000: failed NUMA pages inquiry: Bad address
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:394

Repeated calls are fine.

Huh. So it's only the first call that does this?

Can you maybe print the addresses passed to pg_numa_query_pages? I
wonder if there's some bug in how we fill that array. Not sure why would
it happen only on 32-bit systems, though.

I'll create a 32-bit VM so that I can try reproducing this.

regards

--
Tomas Vondra

#10Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#9)

Re: Tomas Vondra

Huh. So it's only the first call that does this?

The first call after a restart. Reconnecting is not enough.

Can you maybe print the addresses passed to pg_numa_query_pages? I

The addresses look good:

Breakpoint 1, pg_numa_query_pages (pid=0, count=32768, pages=0xeb44d02c, status=0xeb42c02c) at ../src/port/pg_numa.c:49
49 return numa_move_pages(pid, count, pages, NULL, status, 0);
(gdb) p *pages
$1 = (void *) 0xebc33000
(gdb) p pages[1]
$2 = (void *) 0xebc34000
(gdb) p pages[2]
$3 = (void *) 0xebc35000

wonder if there's some bug in how we fill that array. Not sure why would
it happen only on 32-bit systems, though.

I found something, but that should be harmless:

--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -365,7 +365,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		/* Used to determine the NUMA node for all OS pages at once */
 		os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
-		os_page_status = palloc(sizeof(uint64) * os_page_count);
+		os_page_status = palloc(sizeof(int) * os_page_count);

/* Fill pointers for all the memory pages. */
idx = 0;

Christoph

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#10)

On 6/23/25 22:31, Christoph Berg wrote:

Re: Tomas Vondra

Huh. So it's only the first call that does this?

The first call after a restart. Reconnecting is not enough.

Can you maybe print the addresses passed to pg_numa_query_pages? I

The addresses look good:

Breakpoint 1, pg_numa_query_pages (pid=0, count=32768, pages=0xeb44d02c, status=0xeb42c02c) at ../src/port/pg_numa.c:49
49 return numa_move_pages(pid, count, pages, NULL, status, 0);
(gdb) p *pages
$1 = (void *) 0xebc33000
(gdb) p pages[1]
$2 = (void *) 0xebc34000
(gdb) p pages[2]
$3 = (void *) 0xebc35000

Didn't you say the first ~35 addresses succeed, right? What about the
addresses after that?

wonder if there's some bug in how we fill that array. Not sure why would
it happen only on 32-bit systems, though.

I found something, but that should be harmless:

--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -365,7 +365,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
/* Used to determine the NUMA node for all OS pages at once */
os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
-		os_page_status = palloc(sizeof(uint64) * os_page_count);
+		os_page_status = palloc(sizeof(int) * os_page_count);

Yes, good catch. But as you say, that should be benign - we allocate
more memory than needed, I don't think it should break anything.

--
Tomas Vondra

#12Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#11)

Re: Tomas Vondra

Didn't you say the first ~35 addresses succeed, right? What about the
addresses after that?

That was pg_shmem_allocations_numa. The pg_numa_query_pages() in there
works (does not return -1), but then some of the status[] values are
-14.

When pg_buffercache_numa fails, pg_numa_query_pages() itself
returns -14.

The printed os_page_ptrs[] contents are the same for the failing and
non-failing calls, so the problem is probably elsewhere.

        /* Fill pointers for all the memory pages. */
        idx = 0;
        for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
        {
+           if (idx < 50)
+               elog(DEBUG1, "os_page_ptrs idx %d = %p", idx, ptr);
            os_page_ptrs[idx++] = ptr;

20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG: 00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 3 = 0xebc47000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 4 = 0xebc48000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 5 = 0xebc49000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 6 = 0xebc4a000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 7 = 0xebc4b000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 8 = 0xebc4c000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 9 = 0xebc4d000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 10 = 0xebc4e000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 11 = 0xebc4f000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 12 = 0xebc50000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 13 = 0xebc51000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 14 = 0xebc52000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 15 = 0xebc53000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 16 = 0xebc54000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 17 = 0xebc55000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 18 = 0xebc56000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 19 = 0xebc57000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 20 = 0xebc58000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 21 = 0xebc59000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 22 = 0xebc5a000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 23 = 0xebc5b000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 24 = 0xebc5c000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 25 = 0xebc5d000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 26 = 0xebc5e000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 27 = 0xebc5f000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 28 = 0xebc60000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 29 = 0xebc61000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 30 = 0xebc62000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 31 = 0xebc63000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 32 = 0xebc64000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 33 = 0xebc65000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 34 = 0xebc66000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 35 = 0xebc67000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 36 = 0xebc68000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 37 = 0xebc69000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 38 = 0xebc6a000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 39 = 0xebc6b000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 40 = 0xebc6c000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 41 = 0xebc6d000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 42 = 0xebc6e000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 43 = 0xebc6f000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 44 = 0xebc70000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 45 = 0xebc71000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 46 = 0xebc72000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:385
2025-06-23 20:47:41.827 UTC [1368080] ERROR: failed NUMA pages inquiry: Bad address
2025-06-23 20:47:41.827 UTC [1368080] STATEMENT: select * from pg_buffercache_numa;
ERROR: XX000: failed NUMA pages inquiry: Bad address
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:396
Time: 92.757 ms

20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG: 00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 3 = 0xebc47000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 4 = 0xebc48000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 5 = 0xebc49000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 6 = 0xebc4a000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 7 = 0xebc4b000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 8 = 0xebc4c000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 9 = 0xebc4d000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 10 = 0xebc4e000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 11 = 0xebc4f000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 12 = 0xebc50000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 13 = 0xebc51000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 14 = 0xebc52000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 15 = 0xebc53000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 16 = 0xebc54000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 17 = 0xebc55000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 18 = 0xebc56000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 19 = 0xebc57000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 20 = 0xebc58000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 21 = 0xebc59000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 22 = 0xebc5a000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 23 = 0xebc5b000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 24 = 0xebc5c000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 25 = 0xebc5d000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 26 = 0xebc5e000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 27 = 0xebc5f000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 28 = 0xebc60000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 29 = 0xebc61000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 30 = 0xebc62000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 31 = 0xebc63000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 32 = 0xebc64000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 33 = 0xebc65000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 34 = 0xebc66000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 35 = 0xebc67000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 36 = 0xebc68000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 37 = 0xebc69000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 38 = 0xebc6a000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 39 = 0xebc6b000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 40 = 0xebc6c000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 41 = 0xebc6d000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 42 = 0xebc6e000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 43 = 0xebc6f000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 44 = 0xebc70000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 45 = 0xebc71000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 46 = 0xebc72000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:385
DEBUG: 00000: NUMA: page-faulting the buffercache for proper NUMA readouts
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:444
Time: 24.547 ms
20:47 myon@postgres =#

Christoph

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#12)

On 6/23/25 22:51, Christoph Berg wrote:

Re: Tomas Vondra

Didn't you say the first ~35 addresses succeed, right? What about the
addresses after that?

That was pg_shmem_allocations_numa. The pg_numa_query_pages() in there
works (does not return -1), but then some of the status[] values are
-14.

When pg_buffercache_numa fails, pg_numa_query_pages() itself
returns -14.

The printed os_page_ptrs[] contents are the same for the failing and
non-failing calls, so the problem is probably elsewhere.

/* Fill pointers for all the memory pages. */
idx = 0;
for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
{
+           if (idx < 50)
+               elog(DEBUG1, "os_page_ptrs idx %d = %p", idx, ptr);
os_page_ptrs[idx++] = ptr;

20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG: 00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 3 = 0xebc47000

...

DEBUG: 00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:385
2025-06-23 20:47:41.827 UTC [1368080] ERROR: failed NUMA pages inquiry: Bad address
2025-06-23 20:47:41.827 UTC [1368080] STATEMENT: select * from pg_buffercache_numa;
ERROR: XX000: failed NUMA pages inquiry: Bad address
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:396
Time: 92.757 ms

20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG: 00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 3 = 0xebc47000

...> DEBUG: 00000: os_page_ptrs idx 46 = 0xebc72000

LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG: 00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:385
DEBUG: 00000: NUMA: page-faulting the buffercache for proper NUMA readouts
LOCATION: pg_buffercache_numa_pages, pg_buffercache_pages.c:444
Time: 24.547 ms
20:47 myon@postgres =#

True. If it fails on first call, but succeeds on the other, then the
problem is likely somewhere else. But also on the second call we won't
do the memory touching. Can you try setting firstNumaTouch=false, so
that we do this on every call?

At the beginning you mentioned this is happening on i386, armel and
armhf - are all those in qemu? I've tried on my rpi5 (with 32-bit user
space), and there everything seems to work fine. But that's aarch64
kernel, just the user space if 32-bit.

regards

--
Tomas Vondra

#14Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#13)

Re: Tomas Vondra

True. If it fails on first call, but succeeds on the other, then the
problem is likely somewhere else. But also on the second call we won't
do the memory touching. Can you try setting firstNumaTouch=false, so
that we do this on every call?

firstNumaTouch=false, it still fails on the first call.

I assume you meant actually keeping firstNumaTouch=true - but it still
fails on the first call.

The memory touching is done for the first call in each backend, but
reconnecting doesn't reset it, I have to restart PG.

At the beginning you mentioned this is happening on i386, armel and
armhf - are all those in qemu? I've tried on my rpi5 (with 32-bit user
space), and there everything seems to work fine. But that's aarch64
kernel, just the user space if 32-bit.

I'm testing on i386 in a chroot on a amd64 kernel. (same for x32)
armel and armhf are also 32-bit chroots on a arm64 host.

https://buildd.debian.org/status/package.php?p=postgresql-18&amp;suite=experimental

Maybe this is a kernel bug.

Christoph

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#14)

On 6/23/25 23:25, Christoph Berg wrote:

Re: Tomas Vondra

True. If it fails on first call, but succeeds on the other, then the
problem is likely somewhere else. But also on the second call we won't
do the memory touching. Can you try setting firstNumaTouch=false, so
that we do this on every call?

firstNumaTouch=false, it still fails on the first call.

I assume you meant actually keeping firstNumaTouch=true - but it still
fails on the first call.

No, I meant firstNumaTouch=false, so that the touching happens on every
call. I was wondering if that makes all calls fail.

The memory touching is done for the first call in each backend, but
reconnecting doesn't reset it, I have to restart PG.

I don't follow. Why wouldn't reconnecting reset it?

At the beginning you mentioned this is happening on i386, armel and
armhf - are all those in qemu? I've tried on my rpi5 (with 32-bit user
space), and there everything seems to work fine. But that's aarch64
kernel, just the user space if 32-bit.

I'm testing on i386 in a chroot on a amd64 kernel. (same for x32)
armel and armhf are also 32-bit chroots on a arm64 host.

https://buildd.debian.org/status/package.php?p=postgresql-18&amp;suite=experimental

Maybe this is a kernel bug.

Or maybe the 32-bit chroot on 64-bit host matters and confuses some
calculation.

--
Tomas Vondra

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#15)

On 6/23/25 23:47, Tomas Vondra wrote:

...

Or maybe the 32-bit chroot on 64-bit host matters and confuses some
calculation.

I think it's likely something like this. I noticed that if I modify
pg_buffercache_numa_pages() to query the addresses one by one, it works.
And when I increase the number, it stops working somewhere between 16k
and 17k items.

It may be a coincidence, but I suspect it's related to the sizeof(void
*) being 8 in the kernel, but only 4 in the chroot. So the userspace
passes an array of 4-byte items, but kernel interprets that as 8-byte
items. That is, we call

long move_pages(int pid, unsigned long count, void *pages[.count], const
int nodes[.count], int status[.count], int flags);

Which (I assume) just passes the parameters to kernel. And it'll
interpret them per kernel pointer size.

If this is what's happening, I'm not sure what to do about it ...

FWIW while looking into this, I tried running this under valgrind (on a
regular 64-bit system, not in the chroot), and I get this report:

==65065== Invalid read of size 8
==65065== at 0x113B0EBE: pg_buffercache_numa_pages
(pg_buffercache_pages.c:380)
==65065== by 0x6B539D: ExecMakeTableFunctionResult (execSRF.c:234)
==65065== by 0x6CEB7E: FunctionNext (nodeFunctionscan.c:94)
==65065== by 0x6B6ACA: ExecScanFetch (execScan.h:126)
==65065== by 0x6B6B31: ExecScanExtended (execScan.h:170)
==65065== by 0x6B6C9D: ExecScan (execScan.c:59)
==65065== by 0x6CEF0F: ExecFunctionScan (nodeFunctionscan.c:269)
==65065== by 0x6B29FA: ExecProcNodeFirst (execProcnode.c:469)
==65065== by 0x6A6F56: ExecProcNode (executor.h:313)
==65065== by 0x6A9533: ExecutePlan (execMain.c:1679)
==65065== by 0x6A7422: standard_ExecutorRun (execMain.c:367)
==65065== by 0x6A7330: ExecutorRun (execMain.c:304)
==65065== by 0x934EF0: PortalRunSelect (pquery.c:921)
==65065== by 0x934BD8: PortalRun (pquery.c:765)
==65065== by 0x92E4CD: exec_simple_query (postgres.c:1273)
==65065== by 0x93301E: PostgresMain (postgres.c:4766)
==65065== by 0x92A88B: BackendMain (backend_startup.c:124)
==65065== by 0x85A7C7: postmaster_child_launch (launch_backend.c:290)
==65065== by 0x860111: BackendStartup (postmaster.c:3580)
==65065== by 0x85DE6F: ServerLoop (postmaster.c:1702)
==65065== Address 0x7b6c000 is in a rw- anonymous segment

This fails here (on the pg_numa_touch_mem_if_required call):

for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
{
os_page_ptrs[idx++] = ptr;

/* Only need to touch memory once per backend process */
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, ptr);
}

The 0x7b6c000 is the very first pointer, and it's the only pointer that
triggers this warning. At first I thought there's something wrong with
how we align the pointer using TYPEALIGN_DOWN(), but then I noticed it's
actually the pointer of BufferGetBlock(1).

So I'm a bit puzzled by this, and I'm not sure it's related to the other
issue at all (it probably is not).

It's a bit too late here, I'll continue investigating this tomorrow.

--
Tomas Vondra

#17Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#16)

Hi,

On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote:

On 6/23/25 23:47, Tomas Vondra wrote:

...

Or maybe the 32-bit chroot on 64-bit host matters and confuses some
calculation.

I think it's likely something like this.

I think the same.

I noticed that if I modify
pg_buffercache_numa_pages() to query the addresses one by one, it works.
And when I increase the number, it stops working somewhere between 16k
and 17k items.

Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
than 16 pages.

It's also confirmed by test_chunk_size.c attached:

$ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
$ ./test_chunk_size
1 pages: SUCCESS (0 errors)
2 pages: SUCCESS (0 errors)
3 pages: SUCCESS (0 errors)
4 pages: SUCCESS (0 errors)
5 pages: SUCCESS (0 errors)
6 pages: SUCCESS (0 errors)
7 pages: SUCCESS (0 errors)
8 pages: SUCCESS (0 errors)
9 pages: SUCCESS (0 errors)
10 pages: SUCCESS (0 errors)
11 pages: SUCCESS (0 errors)
12 pages: SUCCESS (0 errors)
13 pages: SUCCESS (0 errors)
14 pages: SUCCESS (0 errors)
15 pages: SUCCESS (0 errors)
16 pages: SUCCESS (0 errors)
17 pages: 1 errors
Threshold: 17 pages

No error if -m32 is not used.

It may be a coincidence, but I suspect it's related to the sizeof(void
*) being 8 in the kernel, but only 4 in the chroot. So the userspace
passes an array of 4-byte items, but kernel interprets that as 8-byte
items. That is, we call

long move_pages(int pid, unsigned long count, void *pages[.count], const
int nodes[.count], int status[.count], int flags);

Which (I assume) just passes the parameters to kernel. And it'll
interpret them per kernel pointer size.

I also suspect something in this area...

If this is what's happening, I'm not sure what to do about it ...

We could work by chunks (16?) on 32 bits but would probably produce performance
degradation (we mention it in the doc though). Also would always 16 be a correct
chunk size?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

test_chunk_size.ctext/x-csrc; charset=us-asciiDownload
#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bertrand Drouvot (#17)

On 6/24/25 10:24, Bertrand Drouvot wrote:

Hi,

On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote:

On 6/23/25 23:47, Tomas Vondra wrote:

...

Or maybe the 32-bit chroot on 64-bit host matters and confuses some
calculation.

I think it's likely something like this.

I think the same.

I noticed that if I modify
pg_buffercache_numa_pages() to query the addresses one by one, it works.
And when I increase the number, it stops working somewhere between 16k
and 17k items.

Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
than 16 pages.

It's also confirmed by test_chunk_size.c attached:

$ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
$ ./test_chunk_size
1 pages: SUCCESS (0 errors)
2 pages: SUCCESS (0 errors)
3 pages: SUCCESS (0 errors)
4 pages: SUCCESS (0 errors)
5 pages: SUCCESS (0 errors)
6 pages: SUCCESS (0 errors)
7 pages: SUCCESS (0 errors)
8 pages: SUCCESS (0 errors)
9 pages: SUCCESS (0 errors)
10 pages: SUCCESS (0 errors)
11 pages: SUCCESS (0 errors)
12 pages: SUCCESS (0 errors)
13 pages: SUCCESS (0 errors)
14 pages: SUCCESS (0 errors)
15 pages: SUCCESS (0 errors)
16 pages: SUCCESS (0 errors)
17 pages: 1 errors
Threshold: 17 pages

No error if -m32 is not used.

It may be a coincidence, but I suspect it's related to the sizeof(void
*) being 8 in the kernel, but only 4 in the chroot. So the userspace
passes an array of 4-byte items, but kernel interprets that as 8-byte
items. That is, we call

long move_pages(int pid, unsigned long count, void *pages[.count], const
int nodes[.count], int status[.count], int flags);

Which (I assume) just passes the parameters to kernel. And it'll
interpret them per kernel pointer size.

I also suspect something in this area...

If this is what's happening, I'm not sure what to do about it ...

We could work by chunks (16?) on 32 bits but would probably produce performance
degradation (we mention it in the doc though). Also would always 16 be a correct
chunk size?

I don't see how this would solve anything?

AFAICS the problem is the two places are confused about how large the
array elements are, and get to interpret that differently. Using a
smaller array won't solve that. The pg function would still allocate
array of 16 x 32-bit pointers, and the kernel would interpret this as 16
x 64-bit pointers. And that means the kernel will (a) write into memory
beyond the allocated buffer - a clear buffer overflow, and (b) see bogus
pointers, because it'll concatenate two 32-bit pointers.

I don't see how using smaller array makes this correct. That it works is
more a matter of luck, and also a consequence of still allocating the
whole array, so there's no overflow (at least I kept that, not sure how
you did the chunks).

If I fix the code to make the entries 64-bit (by treating the pointers
as int64), it suddenly starts working - no bad addresses, etc. Well,
almost, because I get this

bufferid | os_page_num | numa_node
----------+-------------+-----------
1 | 0 | 0
1 | 1 | -14
2 | 2 | 0
2 | 3 | -14
3 | 4 | 0
3 | 5 | -14
4 | 6 | 0
4 | 7 | -14
...

The -14 status is interesting, because that's the same value Christoph
reported as the other issue (in pg_shmem_allocations_numa).

I did an experiment and changed os_page_status to be declared as int64,
not just int. And interestingly, that produced this:

bufferid | os_page_num | numa_node
----------+-------------+-----------
1 | 0 | 0
1 | 1 | 0
2 | 2 | 0
2 | 3 | 0
3 | 4 | 0
3 | 5 | 0
4 | 6 | 0
4 | 7 | 0
...

But I don't see how this makes any sense, because "int" should be 4B in
both cases (in 64-bit kernel and 32-bit chroot).

FWIW I realized this applies to "official" systems with 32-bit user
space on 64-bit kernels, like e.g. rpi5 with RPi OS 32-bit. (Fun fact,
rpi5 has 8 NUMA nodes, with all CPUs attached to all NUMA nodes.)

I'm starting to think we need to disable NUMA for setups like this,
mixing 64-bit kernels with 32-bit chroot. Is there a good way to detect
those, so that we can error-out?

FWIW this doesn't explain the strange valgrind issue, though.

--
Tomas Vondra

#19Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#18)

Hi,

On Tue, Jun 24, 2025 at 11:20:15AM +0200, Tomas Vondra wrote:

On 6/24/25 10:24, Bertrand Drouvot wrote:

Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
than 16 pages.

It's also confirmed by test_chunk_size.c attached:

$ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
$ ./test_chunk_size
1 pages: SUCCESS (0 errors)
2 pages: SUCCESS (0 errors)
3 pages: SUCCESS (0 errors)
4 pages: SUCCESS (0 errors)
5 pages: SUCCESS (0 errors)
6 pages: SUCCESS (0 errors)
7 pages: SUCCESS (0 errors)
8 pages: SUCCESS (0 errors)
9 pages: SUCCESS (0 errors)
10 pages: SUCCESS (0 errors)
11 pages: SUCCESS (0 errors)
12 pages: SUCCESS (0 errors)
13 pages: SUCCESS (0 errors)
14 pages: SUCCESS (0 errors)
15 pages: SUCCESS (0 errors)
16 pages: SUCCESS (0 errors)
17 pages: 1 errors
Threshold: 17 pages

No error if -m32 is not used.

We could work by chunks (16?) on 32 bits but would probably produce performance
degradation (we mention it in the doc though). Also would always 16 be a correct
chunk size?

I don't see how this would solve anything?

AFAICS the problem is the two places are confused about how large the
array elements are, and get to interpret that differently.

I don't see how using smaller array makes this correct. That it works is
more a matter of luck,

Not sure it's luck, maybe the wrong pointers arithmetic has no effect if batch
size is <= 16.

So we have kernel_move_pages() -> kernel_move_pages() (because nodes is NULL here
for us as we call "numa_move_pages(pid, count, pages, NULL, status, 0);").

So, if we look at do_pages_stat() ([1]https://github.com/torvalds/linux/blob/master/mm/migrate.c), we can see that it uses an hardcoded
"#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic:

"
pages += chunk_nr;
status += chunk_nr;
"

is done but has no effect since nr_pages will exit the loop if we use a batch
size <= 16.

So if this pointer arithmetic is not correct, (it seems that it should advance
by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch
size is <= 16.

Does test_chunk_size also fails at 17 for you?

[1]: https://github.com/torvalds/linux/blob/master/mm/migrate.c

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#20Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#16)

Hi,

On 2025-06-24 03:43:19 +0200, Tomas Vondra wrote:

FWIW while looking into this, I tried running this under valgrind (on a
regular 64-bit system, not in the chroot), and I get this report:

==65065== Invalid read of size 8
==65065== at 0x113B0EBE: pg_buffercache_numa_pages
(pg_buffercache_pages.c:380)
==65065== by 0x6B539D: ExecMakeTableFunctionResult (execSRF.c:234)
==65065== by 0x6CEB7E: FunctionNext (nodeFunctionscan.c:94)
==65065== by 0x6B6ACA: ExecScanFetch (execScan.h:126)
==65065== by 0x6B6B31: ExecScanExtended (execScan.h:170)
==65065== by 0x6B6C9D: ExecScan (execScan.c:59)
==65065== by 0x6CEF0F: ExecFunctionScan (nodeFunctionscan.c:269)
==65065== by 0x6B29FA: ExecProcNodeFirst (execProcnode.c:469)
==65065== by 0x6A6F56: ExecProcNode (executor.h:313)
==65065== by 0x6A9533: ExecutePlan (execMain.c:1679)
==65065== by 0x6A7422: standard_ExecutorRun (execMain.c:367)
==65065== by 0x6A7330: ExecutorRun (execMain.c:304)
==65065== by 0x934EF0: PortalRunSelect (pquery.c:921)
==65065== by 0x934BD8: PortalRun (pquery.c:765)
==65065== by 0x92E4CD: exec_simple_query (postgres.c:1273)
==65065== by 0x93301E: PostgresMain (postgres.c:4766)
==65065== by 0x92A88B: BackendMain (backend_startup.c:124)
==65065== by 0x85A7C7: postmaster_child_launch (launch_backend.c:290)
==65065== by 0x860111: BackendStartup (postmaster.c:3580)
==65065== by 0x85DE6F: ServerLoop (postmaster.c:1702)
==65065== Address 0x7b6c000 is in a rw- anonymous segment

This fails here (on the pg_numa_touch_mem_if_required call):

for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
{
os_page_ptrs[idx++] = ptr;

/* Only need to touch memory once per backend process */
if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, ptr);
}

That's because we mark unpinned pages as inaccessible / mark them as
accessible when pinning. See logic related to that in PinBuffer():

/*
* Assume that we acquired a buffer pin for the purposes of
* Valgrind buffer client checks (even in !result case) to
* keep things simple. Buffers that are unsafe to access are
* not generally guaranteed to be marked undefined or
* non-accessible in any case.
*/

The 0x7b6c000 is the very first pointer, and it's the only pointer that
triggers this warning.

I suspect that that's because valgrind combines different reports or such.

Greetings,

Andres Freund

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bertrand Drouvot (#19)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#20)
#23Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#21)
#24Christoph Berg
myon@debian.org
In reply to: Bertrand Drouvot (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#24)
#26Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#25)
#27Christoph Berg
myon@debian.org
In reply to: Bertrand Drouvot (#23)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#26)
#29Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Christoph Berg (#24)
#30Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Christoph Berg (#26)
#31Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#28)
#32Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Christoph Berg (#26)
#33Christoph Berg
myon@debian.org
In reply to: Jakub Wartak (#32)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christoph Berg (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jakub Wartak (#32)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#35)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#36)
#38Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Christoph Berg (#33)
#39Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#28)
#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bertrand Drouvot (#39)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bertrand Drouvot (#38)
#42Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#41)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bertrand Drouvot (#42)
#44Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tomas Vondra (#43)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bertrand Drouvot (#44)
#46Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#40)
#47Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Christoph Berg (#46)
#48Christoph Berg
myon@debian.org
In reply to: Tomas Vondra (#45)
#49Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#48)
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tomas Vondra (#45)
#51Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#50)
#52Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Alvaro Herrera (#51)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bertrand Drouvot (#52)