Draft for basic NUMA observability

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#1)

Re: Draft for basic NUMA observability

Hi,

On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:

As I have promised to Andres on the Discord hacking server some time
ago, I'm attaching the very brief (and potentially way too rushed)
draft of the first step into NUMA observability on PostgreSQL that was
based on his presentation [0]. It might be rough, but it is to get us
started. The patches were not really even basically tested, they are
more like input for discussion - rather than solid code - to shake out
what should be the proper form of this.

Right now it gives:

postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 16127
6 | 256
1 | 1

Thanks for the patch!

Not doing a code review but sharing some experimentation.

First, I had to:

@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
                Size            os_page_size;
                void            **os_page_ptrs;
                int                     *os_pages_status;
-               int                     os_page_count;
+               uint64          os_page_count;

and

-               os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;

to make it work with non tiny shared_buffers.

Observations:

when using 2 sessions:

Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.

To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(<pid>, …).

Let me show the results:

In session 1 (that "accessed/loaded" the ~65K buffers):

postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177310
0 | 65192
-2 | 378
(3 rows)

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
1662580

In session 2:

^
postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 90
-2 | 65489
(3 rows)

But when session's 1 pid is used:

postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 65195
-2 | 384
(3 rows)

Results show:

Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed the buffers)

Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior though.

The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:

"
One annoying thing about get_mempolicy() is this:

If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread
had performed a read (load) access to that address, and return the ID of the node where that page was allocated.

Forcing the allocation to happen inside a monitoring function is decidedly not great.
"

The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>")
while using get_mempolicy().

But maybe we could use get_mempolicy() only on "valid" buffers i.e
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#2)

Re: Draft for basic NUMA observability

On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Bertrand,

Thanks for playing with this!

Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though.

You are correct, I'm observing identical behaviour, please see attached.

Forcing the allocation to happen inside a monitoring function is decidedly not great.

We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results. In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)

Another thing is that numa_maps(5) won't help us a lot too (not enough
granularity).

But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:

if(os_page_ptrs[blk2page+j] == 0) {
+    volatile uint64 touch pg_attribute_unused();
    os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+    touch = *(uint64 *)os_page_ptrs[blk2page+j];
}

and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.

Frankly speaking I do not know which path to take with this, maybe
that's good enough?

-J.

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#3)

Re: Draft for basic NUMA observability

Hi Jakub,

On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote:

On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Bertrand,

Thanks for playing with this!

Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though.

You are correct, I'm observing identical behaviour, please see attached.

Thanks for confirming!

We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results.

Yup.

In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)

Yeah, get_mempolicy() not working on a range is not great.

But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:
if(os_page_ptrs[blk2page+j] == 0) {
+    volatile uint64 touch pg_attribute_unused();
os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+    touch = *(uint64 *)os_page_ptrs[blk2page+j];
}
and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.

One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages).

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#4)

Re: Draft for basic NUMA observability

Hi Bertrand,

TL;DR; the main problem seems choosing which way to page-fault the
shared memory before the backend is going to use numa_move_pages() as
the memory mappings (fresh after fork()/CoW) seem to be not ready for
numa_move_pages() inquiry.

On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results.

Yup.

OK so I've made that NUMA inquiry (now with that "volatile touch" to
get valid results for not used memory) into a new and separate
pg_buffercache_numa view. This avoids the problem that somebody would
automatically run into this slow path when using pg_buffercache.

But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:
if(os_page_ptrs[blk2page+j] == 0) {
+    volatile uint64 touch pg_attribute_unused();
os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+    touch = *(uint64 *)os_page_ptrs[blk2page+j];
}
and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.
One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.

We never get fully accurate state as the zone memory migration might
be happening as we query it, but in theory we could add something to
e.g. checkpointer/bgwriter that would inquiry it on demand and report
it back somewhat through shared memory (?), but I'm somehow afraid
because as stated at the end of email, it might take some time (well
we probably wouldn't need to "touch memory" then after all, as all of
it is active), but that's still impact to those bgworkers. Somehow I
feel safer if that code is NOT part of bgworker.

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages).

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?

The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation. Probably some
DBA in the wild could also query it to see how PG/kernel distributes
memory from time to time. It seems to be more debugging and coding aid
for future NUMA optimizations, rather than being used by some
monitoring constantly. I would even dare to say it would require
--enable-debug (or some other developer-only toggle), but apparently
there's no need to hide it like that if those are separate views.

Changes since previous version:
0. rebase due the recent OAuth commit introducing libcurl
1. cast uint64 for NBuffers as You found out
2. put stuff into pg_buffercache_numa
3. 0003 adds pg_shmem_numa_allocations Or should we rather call it
pg_shmem_numa_zones or maybe just pg_shm_numa ?

If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:
a. isn't there something quicker for touching / page-faulting memory ?
If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried
additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it
probably only works for parent//postmaster). I've also tried
MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too:

+       rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ);
+       if(rc != 0) {
+               elog(NOTICE, "madvice() failed");
+       }
[..]
-                       volatile uint64 touch pg_attribute_unused();
                        os_page_ptrs[i] = (char *)ent->location + (i *
os_page_size);
-                       touch = *(uint64 *)os_page_ptrs[i];

with volatile touching memory or MADV_POPULATE_READ the result seems
to reliable (s_b 128MB here):

postgres@postgres:1234 : 14442 # select * from
pg_shmem_numa_allocations order by numa_size desc;
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
Buffer Blocks | 0 | 134221824
XLOG Ctl | 0 | 4206592
Buffer Descriptors | 0 | 1048576
transaction | 0 | 528384
Checkpointer Data | 0 | 524288
Checkpoint BufferIds | 0 | 327680
Shared Memory Stats | 0 | 311296
[..]

without at least one of those two, new backend reports complete garbage:

name | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
Buffer Blocks | 0 | 995328
Shared Memory Stats | 0 | 245760
shmInvalBuffer | 0 | 65536
Buffer Descriptors | 0 | 65536
Backend Status Array | 0 | 61440
serializable | 0 | 57344
[..]

b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
d. fix tests, indent it, docs, make cfbot happy

As for the sampling, dunno, fine for me. As an optional argument? but
wouldn't it be better to find a way to actually for it to be quick?

OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to
128GB with numactl --interleave=all pg_ctl start:

postgres=# select * from pg_shmem_numa_allocations ;
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-------------
[..]
Buffer Blocks | 0 | 17179869184
Buffer Blocks | 1 | 17179869184
Buffer Blocks | 2 | 17179869184
Buffer Blocks | 3 | 17179869184
Buffer Blocks | 4 | 17179869184
Buffer Blocks | 5 | 17179869184
Buffer Blocks | 6 | 17179869184
Buffer Blocks | 7 | 17179869184
Buffer IO Condition Variables | 0 | 33554432
Buffer IO Condition Variables | 1 | 33554432
Buffer IO Condition Variables | 2 | 33554432
[..]

but it takes 23s. Yes it takes 23s to just gather that info with
memory touch, but that's ~128GB of memory and is hardly responsible
(lack of C_F_I()). By default without numactl's interleave=all, you
get clear picture of lack of NUMA awareness in PG shared segment (just
as Andres presented, but now it is evident; well it is subject to
autobalancing of course):

postgres=# select * from pg_shmem_numa_allocations ;
name | numa_zone_id | numa_size
------------------------------------------------+--------------+-------------
[..]
commit_timestamp | 0 | 2097152
commit_timestamp | 1 | 6291456
commit_timestamp | 2 | 0
commit_timestamp | 3 | 0
commit_timestamp | 4 | 0
[..]
transaction | 0 | 14680064
transaction | 1 | 0
transaction | 2 | 0
transaction | 3 | 0
transaction | 4 | 2097152
[..]

Somehow without interleave it is very quick too.

-J.

andres@anarazel.de

about 1 year ago

In reply to: Jakub Wartak (#5)

Re: Draft for basic NUMA observability

Hi,

On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages).

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?

The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation.

Yea. E.g. for some benchmark workloads the difference whether the root btree
page is on the same NUMA node as the workload or not makes a roughly 2x perf
difference. It's really hard to determine that today.

If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:

a. isn't there something quicker for touching / page-faulting memory ?

If you actually fault in a page the kernel actually has to allocate memory and
then zero it out. That rather severely limits the throughput...

If not then maybe add CHECKS_FOR_INTERRUPTS() there?

Should definitely be there.

BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
help (it probably only works for parent//postmaster).

Yes, needs to be in postmaster.

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?

FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.

b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?

You mean a specific context instead of CurrentMemoryContext?

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2e..e3b7554d9e8 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -436,6 +436,10 @@ task:
SANITIZER_FLAGS: -fsanitize=address
PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range

+
+      # FIXME: use or not the libnuma?
+      #      --with-libnuma \
+      #
# Normally, the "relation segment" code basically has no coverage in our
# tests, because we (quite reasonably) don't generate tables large
# enough in tests. We've had plenty bugs that we didn't notice due
the

I don't see why not.

diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..e5b3d1f7dd2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,30 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the function.
+DROP FUNCTION pg_buffercache_pages() CASCADE;

Why? I think that's going to cause problems, as the pg_buffercache view
depends on it, and user views might turn in depend on pg_buffercache. I think
CASCADE is rarely, if ever, ok to use in an extension scripot.

+CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C PARALLEL SAFE;

+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache AS
+	SELECT P.* FROM pg_buffercache_pages(false) AS P
+	(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+	 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+	 pinning_backends int4);
+
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+	SELECT P.* FROM pg_buffercache_pages(true) AS P
+	(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+	 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+	 pinning_backends int4, numa_zone_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;

We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
for _numa?

@@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isvalid = false;
+#ifdef USE_LIBNUMA
+/* FIXME: taken from bufmgr.c, maybe move to .h ? */
+#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+			blk2page = (int) i * pages_per_blk;

BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

+			j = 0;
+			do {
+				/*
+				 * Many buffers can point to the same page, but we want to
+				 * query just first address.
+				 *
+				 * In order to get reliable results we also need to touch memory pages
+				 * so that inquiry about NUMA zone doesn't return -2.
+				 */
+				if(os_page_ptrs[blk2page+j] == 0) {
+					volatile uint64 touch pg_attribute_unused();
+					os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
+					touch = *(uint64 *)os_page_ptrs[blk2page+j];
+				}
+				j++;
+			} while(j < (int)pages_per_blk);
+#endif
+

Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.

+#ifdef USE_LIBNUMA
+		if(query_numa) {
+			/* According to numa(3) it is required to initialize library even if that's no-op. */
+			if(numa_available() == -1) {
+				pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+				elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
+			} else {
+				/* Amortize the number of pages we need to query about */
+				if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
+					elog(ERROR, "failed NUMA pages inquiry status");
+				}

I wonder if we ought to override numa_error() so we can display more useful
errors.

+
+	LWLockAcquire(ShmemIndexLock, LW_SHARED);

Doing multiple memory allocations while holding an lwlock is probably not a
great idea, even if the lock normally isn't contended.

+		os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+		os_pages_status = palloc(sizeof(int) * os_page_count);

Why do this in very loop iteration?

Greetings,

Andres Freund

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Andres Freund (#6)

Re: Draft for basic NUMA observability

Hi,

On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?

That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).

That said, wouldn't that be too strong to impose a restriction that huge_pages
must be enabled?

Jakub, thanks for the new patch version! FWIW, I did not look closely to the
code yet (just did the minor changes already shared to have valid result with non
tiny shared buffer size). I'll look closely at the code for sure once we all agree
on the design part of it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#6)

Re: Draft for basic NUMA observability

On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

Hi Andres, thanks for your review!

OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.

QQ: I'm still wondering if we there's better way of exposing multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?

postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
name | off | ?column?
| allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
Proc Header | 147114112 |
147114240 | 128
Proc Array | 147274752 |
147275392 | 640
KnownAssignedXids | 147275392 |
147310848 | 35456
KnownAssignedXidsValid | 147310848 |
147319808 | 8960
Backend Status Array | 147319808 |
147381248 | 61440

postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
name | numa_zone_id | numa_size
-------------------+--------------+-----------
KnownAssignedXids | 0 | 2097152
Proc Array | 0 | 2097152
Proc Header | 0 | 2097152

I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node

If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:

a. isn't there something quicker for touching / page-faulting memory ?

If you actually fault in a page the kernel actually has to allocate memory and
then zero it out. That rather severely limits the throughput...

OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.

If not then maybe add CHECKS_FOR_INTERRUPTS() there?

Should definitely be there.

Added.

BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
help (it probably only works for parent//postmaster).

Yes, needs to be in postmaster.

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?

Please see attached file for more verbose results, but in short it is
like below:

patch(-touchpages) hugepages=off INVALID RESULTS (-2)
patch(-touchpages) hugepages=on INVALID RESULTS (-2)
patch(touchpages) hugepages=off CORRECT RESULT
patch(touchpages) hugepages=on CORRECT RESULT
patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2)

IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.

BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.

FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)

Done.

c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?

You mean a specific context instead of CurrentMemoryContext?

Yes, I had doubts earlier, but for now I'm going to leave it as it is.

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 91b51142d2e..e3b7554d9e8 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -436,6 +436,10 @@ task:
SANITIZER_FLAGS: -fsanitize=address
PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range

+
+      # FIXME: use or not the libnuma?
+      #      --with-libnuma \
+      #
# Normally, the "relation segment" code basically has no coverage in our
# tests, because we (quite reasonably) don't generate tables large
# enough in tests. We've had plenty bugs that we didn't notice due
the

I don't see why not.

Fixed.

diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..e5b3d1f7dd2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,30 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the function.
+DROP FUNCTION pg_buffercache_pages() CASCADE;

... it's just me cutting corners :^), fixed now.

[..]

+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;

We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
for _numa?

Yup, fixed.

@@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
else
fctx->record[i].isvalid = false;
+#ifdef USE_LIBNUMA
+/* FIXME: taken from bufmgr.c, maybe move to .h ? */
+#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+                     blk2page = (int) i * pages_per_blk;
BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

Fixed, thanks I was looking for something like this! Is that +1 in v4 good?

+                     j = 0;
+                     do {
+                             /*
+                              * Many buffers can point to the same page, but we want to
+                              * query just first address.
+                              *
+                              * In order to get reliable results we also need to touch memory pages
+                              * so that inquiry about NUMA zone doesn't return -2.
+                              */
+                             if(os_page_ptrs[blk2page+j] == 0) {
+                                     volatile uint64 touch pg_attribute_unused();
+                                     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
+                                     touch = *(uint64 *)os_page_ptrs[blk2page+j];
+                             }
+                             j++;
+                     } while(j < (int)pages_per_blk);
+#endif
+

Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

+#ifdef USE_LIBNUMA
+             if(query_numa) {
+                     /* According to numa(3) it is required to initialize library even if that's no-op. */
+                     if(numa_available() == -1) {
+                             pg_buffercache_mark_numa_invalid(fctx, NBuffers);
+                             elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
+                     } else {
+                             /* Amortize the number of pages we need to query about */
+                             if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
+                                     elog(ERROR, "failed NUMA pages inquiry status");
+                             }

I wonder if we ought to override numa_error() so we can display more useful
errors.

Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...

[..]

Doing multiple memory allocations while holding an lwlock is probably not a
great idea, even if the lock normally isn't contended.

[..]

Why do this in very loop iteration?

Both fixed.

-J.

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#7)

Re: Draft for basic NUMA observability

On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?

That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).

Hi Bertrand , please see that nearby thread. I've got quite the
opposite results. I need page-fault memory or I get invalid results
("-2"). What kernel version are you using ? (I've tried it on two
6.10.x series kernels , virtualized in both cases; one was EPYC [real
NUMA, but not VM so not real hardware]).

That said, wouldn't that be too strong to impose a restriction that huge_pages
must be enabled?

Jakub, thanks for the new patch version! FWIW, I did not look closely to the
code yet (just did the minor changes already shared to have valid result with non
tiny shared buffer size). I'll look closely at the code for sure once we all agree
on the design part of it.

Cool, I think we are pretty close actually, but others might have
different perspective.

-J.

#10

andres@anarazel.de

about 1 year ago

In reply to: Jakub Wartak (#8)

Re: Draft for basic NUMA observability

Hi,

On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:

FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Greetings,

Andres Freund

#11

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#10)

Re: Draft for basic NUMA observability

On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:

FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Thanks, I'll take a look into it.

Meanwhile v5 is attached with slight changes to try to make cfbot happy:
1. fixed tests and added tiny copy-cat basic tests for
pg_buffercache_numa and pg_shm_numa_allocations views
2. win32 doesn't have sysconf()

No docs yet.

-J.

#12

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#11)

Re: Draft for basic NUMA observability

Hi,

On Wed, Feb 26, 2025 at 02:05:59PM +0100, Jakub Wartak wrote:

On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:

FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Thanks, I'll take a look into it.

Meanwhile v5 is attached with slight changes to try to make cfbot happy:

Thanks for the updated version!

FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.

Sharing here as .txt files:

v5-0004-configure-changes.txt: changes in configure + add a test on numa.h
availability and a call to numa_available.

v5-0005-pg_numa.c-changes.txt: moving the <unistd.h> outside of USE_LIBNUMA
because the file is still using sysconf() in the non-NUMA code path. Also,
removed a ";" in "#endif;" in the non-NUMA code path.

v5-0006-meson.build-changes.txt.

Those apply on top of your v5.

Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.

Still did not look closely to the code.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#9)

Re: Draft for basic NUMA observability

Hi Jakub,

On Wed, Feb 26, 2025 at 09:48:41AM +0100, Jakub Wartak wrote:

On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?

That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).

Hi Bertrand , please see that nearby thread. I've got quite the
opposite results. I need page-fault memory or I get invalid results
("-2"). What kernel version are you using ? (I've tried it on two
6.10.x series kernels , virtualized in both cases; one was EPYC [real
NUMA, but not VM so not real hardware])

Thanks for sharing your numbers!

It looks like that with hp enabled then the shared_buffers plays a role.

1. With hp, shared_buffers 4GB:

huge_pages_status
-------------------
on
(1 row)

shared_buffers
----------------
4GB
(1 row)

NOTICE: os_page_count=2048 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+--------
| 507618
0 | 1054
-2 | 15616
(3 rows)

2. With hp, shared_buffers 23GB:

huge_pages_status
-------------------
on
(1 row)

shared_buffers
----------------
23GB
(1 row)

NOTICE: os_page_count=11776 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+---------
| 2997974
0 | 16682
(2 rows)

3. no hp, shared_buffers 23GB:

huge_pages_status
-------------------
off
(1 row)

shared_buffers
----------------
23GB
(1 row)

ERROR: extension "pg_buffercache" already exists
NOTICE: os_page_count=6029312 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 2997975
-2 | 16482
1 | 199
(3 rows)

Maybe the kernel is taking some decisions based on the HugePages_Rsvd, I've
no ideas. Anyway there is little than we can do and the "touchpages" patch
seems to provide "accurate" results.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#14

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#12)

Re: Draft for basic NUMA observability

On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
[..]

Meanwhile v5 is attached with slight changes to try to make cfbot happy:

Thanks for the updated version!

FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.

Sharing here as .txt files:

Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.

[..]

Thank you so much for this Bertrand !

I've applied those , played a little bit with configure and meson and
reproduced the test error and fixed it by silencing that NOTICE in
tests. So v6 is attached even before I get a chance to start using
that CI. Still waiting for some input and tests regarding that earlier
touchpages attempt, docs are still missing...

-J.

#15

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#14)

Re: Draft for basic NUMA observability

Hi,

On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote:

On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
[..]

Meanwhile v5 is attached with slight changes to try to make cfbot happy:

Thanks for the updated version!

FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.

Sharing here as .txt files:

Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.

[..]

Thank you so much for this Bertrand !

I've applied those , played a little bit with configure and meson and
reproduced the test error and fixed it by silencing that NOTICE in
tests. So v6 is attached even before I get a chance to start using
that CI. Still waiting for some input and tests regarding that earlier
touchpages attempt, docs are still missing...

Thanks for the new version!

I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).

A few random comments:

=== 1

+               /*
+                * This is for gathering some NUMA statistics. We might be using
+                * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+                * allocated in various different OS memory pages sizes, so first we
+                * need to understand the OS memory page size before calling
+                * move_pages()
+                */
+               os_page_size = pg_numa_get_pagesize();
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+               pages_per_blk = (float) BLCKSZ / os_page_size;
+
+               elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+                        os_page_count, os_page_size, pages_per_blk);
+
+               os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+               os_pages_status = palloc(sizeof(int) * os_page_count);
+               memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+               /*
+                * If we ever get 0xff back from kernel inquiry, then we probably have
+                * bug in our buffers to OS page mapping code here
+                */
+               memset(os_pages_status, 0xff, sizeof(int) * os_page_count);

I think that if (query_numa) check should also wrap that entire section of code.

=== 2

+                       if (query_numa)
+                       { 
+                          blk2page = (int) i * pages_per_blk;
+                          j = 0;
+                          do
+                          {

This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid
code duplication as much a possible though, maybe using a shared
populate_buffercache_entry() or such helper function?

=== 3

+#define ONE_GIGABYTE 1024*1024*1024
+                                               if ((i * os_page_size) % ONE_GIGABYTE == 0)
+                                                       CHECK_FOR_INTERRUPTS();
+                                       }

Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
for every page instead? (I don't see with a 30GB shared buffer). I've the
feeling that we could get rid of the "ONE_GIGABYTE" check.

=== 4

+ pfree(os_page_ptrs);
+ pfree(os_pages_status);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

=== 5

+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{

That's a good idea.

+               for (i = 0; i < shm_ent_page_count; i++)
+               {
+                       /*
+                        * In order to get reliable results we also need to touch memory
+                        * pages so that inquiry about NUMA zone doesn't return -2.
+                        */
+                       volatile uint64 touch pg_attribute_unused();
+
+                       page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+                       pg_numa_touch_mem_if_required(touch, page_ptrs[i]);

That sounds right.

Could we also avoid some code duplication with pg_get_shmem_allocations()?

Also same remarks about pfree() and ONE_GIGABYTE as above.

A few other things:

==== 6

+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "port/pg_numa.h"

Not at the right position, should be between those 2:

#include "miscadmin.h"
#include "storage/lwlock.h"

==== 7

+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ *       Miscellaneous functions for bit-wise operations.

=== 8

+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ *             Basic NUMA portability routines

s/numa.c/pg_numa.c/ ?

=== 9

+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
  *       contrib/pg_buffercache/pg_buffercache_pages.c
  *-------------------------------------------------------------------------
  */
+#include "pg_config.h"
 #include "postgres.h"

Is this new include needed?

#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"

not in the right order.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#8)

Re: Draft for basic NUMA observability

Hi,

On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:

On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:

Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

Not sure I get your idea, could you share what the code would look like?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#15)

Re: Draft for basic NUMA observability

Hi!

On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).

Cool! Attached is v7 that is fully green on cirrus CI that Andres
recommended, we will see how cfbot reacts to this. BTW docs are still
missing. When started with proper interleave=all with s_b=64GB,
hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench:

postgres=# select buffers_used, buffers_unused from pg_buffercache_summary();
buffers_used | buffers_unused
--------------+----------------
170853 | 8217755
(
ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group
by numa_zone_id order by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+---------
0 | 42752
1 | 42752
2 | 42752
3 | 42597
| 8217755
Time: 5828.100 ms (00:05.828)
postgres=# select 3*42752+42597;
?column?
----------
170853

postgres=# select * from pg_shmem_numa_allocations order by numa_size
desc limit 12;
DEBUG: NUMA: page-faulting shared memory segments for proper NUMA readouts
name | numa_zone_id | numa_size
--------------------+--------------+-------------
Buffer Blocks | 0 | 17179869184
Buffer Blocks | 1 | 17179869184
Buffer Blocks | 3 | 17179869184
Buffer Blocks | 2 | 17179869184
Buffer Descriptors | 2 | 134217728
Buffer Descriptors | 1 | 134217728
Buffer Descriptors | 0 | 134217728
Buffer Descriptors | 3 | 134217728
Checkpointer Data | 1 | 67108864
Checkpointer Data | 0 | 67108864
Checkpointer Data | 2 | 67108864
Checkpointer Data | 3 | 67108864

Time: 68.579 ms

A few random comments:

=== 1

[..]

I think that the if (query_numa) check should also wrap that entire section of code.

Done.

=== 2
+                       if (query_numa)
+                       {
+                          blk2page = (int) i * pages_per_blk;
+                          j = 0;
+                          do
+                          {
This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid
code duplication as much a possible though, maybe using a shared
populate_buffercache_entry() or such helper function?

Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
there maybe to hint compiler with some smaller routine to reduce
complexity of that main routine? (assuming NUMA inquiry is going to be
rare).

=== 3
+#define ONE_GIGABYTE 1024*1024*1024
+                                               if ((i * os_page_size) % ONE_GIGABYTE == 0)
+                                                       CHECK_FOR_INTERRUPTS();
+                                       }
Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
for every page instead? (I don't see with a 30GB shared buffer). I've the
feeling that we could get rid of the "ONE_GIGABYTE" check.

You are right and no it was simply my premature optimization attempt,
as apparently CFI on closer looks seems to be already having
unlikely() and looks really cheap, so yeah I've removed that.

=== 4

+ pfree(os_page_ptrs);
+ pfree(os_pages_status);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

Yes, I wanted to have it just for illustrative and stylish purposes,
but right, removed.

=== 5

+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{

[..]

+               for (i = 0; i < shm_ent_page_count; i++)
+               {
+                       /*
+                        * In order to get reliable results we also need to touch memory
+                        * pages so that inquiry about NUMA zone doesn't return -2.
+                        */
+                       volatile uint64 touch pg_attribute_unused();
+
+                       page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+                       pg_numa_touch_mem_if_required(touch, page_ptrs[i]);

That sounds right.

Could we also avoid some code duplication with pg_get_shmem_allocations()?

Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?

Also same remarks about pfree() and ONE_GIGABYTE as above.

Fixed.

A few other things:

==== 6

+#include "port/pg_numa.h"
Not at the right position, should be between those 2:

#include "miscadmin.h"
#include "storage/lwlock.h"

Fixed.

==== 7
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ *       Miscellaneous functions for bit-wise operations.
description is not correct. Also the "Copyright (c) 2019-2025" might be
"Copyright (c) 2025" instead.

Fixed.

=== 8

+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ *             Basic NUMA portability routines

s/numa.c/pg_numa.c/ ?

Fixed.

=== 9

+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
*       contrib/pg_buffercache/pg_buffercache_pages.c
*-------------------------------------------------------------------------
*/
+#include "pg_config.h"
#include "postgres.h"

Is this new include needed?

Removed, don't remember how it arrived here, most have been some
artifact of earlier attempts.

#include "access/htup_details.h"
@@ -13,10 +14,12 @@
#include "funcapi.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"

not in the right order.

Fixed.

And also those from nearby message:

On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:

On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:

Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

Not sure I get your idea, could you share what the code would look like?

Please see pg_buffercache_pages I've just added static bool firstUseInBackend:

postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts
[..]
postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
[..]

Same was done to the pg_get_shmem_numa_allocations.

Also CFbot/cirrus was getting:

[11:01:05.027] In file included from ../../src/include/postgres.h:49,
[11:01:05.027] from pg_buffercache_pages.c:10:
[11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’:
[11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=]

Fixed with %zu (for size_t) instead of %ld.

Linux - Debian Bookworm - Autoconf got:
[10:42:59.216] checking numa.h usability... no
[10:42:59.268] checking numa.h presence... no
[10:42:59.286] checking for numa.h... no
[10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma

I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this.

[13:50:47.449] gcc -m32 @src/backend/postgres.rsp
[13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format

I've also got an error in 32-bit build as libnuma.h is there, but
apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE)
and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for
such build.

-J.

#18

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#17)

Re: Draft for basic NUMA observability

Hi,

On Tue, Mar 04, 2025 at 11:48:31AM +0100, Jakub Wartak wrote:

Hi!

On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).

Cool! Attached is v7

Thanks for the new version!

=== 2
+                       if (query_numa)
+                       {
+                          blk2page = (int) i * pages_per_blk;
+                          j = 0;
+                          do
+                          {
This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid
code duplication as much a possible though, maybe using a shared
populate_buffercache_entry() or such helper function?
Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...

Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.

IMHO rarely anybody uses pg_buffercache, but we could add unlikely()

I think unlikely() should be used for optimization based on code path likelihood,
not based on how often users might use a feature.

=== 5

Could we also avoid some code duplication with pg_get_shmem_allocations()?

Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?

I meant to say avoid code duplication between pg_get_shmem_allocations() and
pg_get_shmem_numa_allocations(). It might be possible to create a shared
function for them too. That said, it looks like that the savings (if any), would
not be that much, so maybe just forget about it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#18)

Re: Draft for basic NUMA observability

On Tue, Mar 4, 2025 at 5:02 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Cool! Attached is v7

Thanks for the new version!

... and another one: 7b ;)

=== 2

[..]

Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...

Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.

OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..

IMHO rarely anybody uses pg_buffercache, but we could add unlikely()

I think unlikely() should be used for optimization based on code path likelihood,
not based on how often users might use a feature.

In 7b I've removed the unlikely() - For a moment I was thinking that
you are concerned about this loop for NBuffers to be as much optimized
as it can and that's the reason for splitting the routines.

=== 5

[..]

I meant to say avoid code duplication between pg_get_shmem_allocations() and
pg_get_shmem_numa_allocations(). It might be possible to create a shared
function for them too. That said, it looks like that the savings (if any), would
not be that much, so maybe just forget about it.

Yeah, OK, so let's leave it at that.

-J.

#20

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Jakub Wartak (#19)

Re: Draft for basic NUMA observability

Hi,
On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hi,

Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.

OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..

I'm attaching basically the earlier stuff (v7b) as v8 with the
following minor changes:
- docs are included
- changed int8 to int4 in one function definition for numa_zone_id

-J.

#21

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Jakub Wartak (#20)

#22

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#21)

#23

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#22)

#24

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Jakub Wartak (#23)

#25

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#23)

#26

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Bertrand Drouvot (#25)

#27

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#26)

#28

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#27)

#29

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#28)

#30

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#29)

#31

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#30)

#32

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#31)

#33

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#32)

#34

Nazir Bilal Yavuz

byavuz81@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#33)

#35

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Nazir Bilal Yavuz (#34)

#36

Alvaro Herrera

alvherre@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#35)

#37

andres@anarazel.de

about 1 year ago

In reply to: Jakub Wartak (#35)

#38

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Alvaro Herrera (#36)

#39

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#37)

#40

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#39)

#41

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#40)

#42

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#41)

#43

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#41)

#44

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#42)

#45

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#43)

#46

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#44)

#47

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#45)

#48

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#46)

#49

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#47)

#50

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#49)

#51

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#49)

#52

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Bertrand Drouvot (#50)

#53

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#50)

#54

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#53)

#55

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#52)

#56

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#51)

#57

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#56)

#58

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#57)

#59

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#58)

#60

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#59)

#61

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Bertrand Drouvot (#58)

#62

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#60)

#63

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#62)

#64

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#63)

#65

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Bertrand Drouvot (#64)

#66

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#65)

#67

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#66)

#68

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#65)

#69

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#66)

#70

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#69)

#71

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#69)

#72

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#70)

#73

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#70)

#74

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#73)

#75

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#74)

#76

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Jakub Wartak (#75)

#77

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Bertrand Drouvot (#76)

#78

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#77)

#79

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#72)

#80

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#71)

#81

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#79)

#82

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#81)

#83

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#82)

#84

andres@anarazel.de

about 1 year ago

In reply to: Jakub Wartak (#62)

#85

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#84)

#86

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#85)

#87

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Andres Freund (#82)

#88

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Bertrand Drouvot (#87)

#89

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#88)

#90

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#89)

#91

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#90)

#92

Shinoda, Noriyoshi (PN Japan FSIP)

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Jakub Wartak (#91)

#93

noriyoshi.shinoda@hpe.com

about 1 year ago

In reply to: Tomas Vondra (#92)

#94

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Shinoda, Noriyoshi (PN Japan FSIP) (#93)

#95

Kirill Reshke

reshkekirill@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#85)

#96

andres@anarazel.de

about 1 year ago

In reply to: Kirill Reshke (#95)

#97

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#96)

#98

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#97)

#99

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#98)

#100

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#99)

#101

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#96)

#102

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#100)

#103

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#101)

#104

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Tomas Vondra (#100)

#105

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#103)

#106

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#105)

#107

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Tomas Vondra (#106)

#108

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#107)

#109

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#108)

#110

andres@anarazel.de

about 1 year ago

In reply to: Tomas Vondra (#109)

#111

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Andres Freund (#110)

#112

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#94)

#113

Patrick Stählin

me@packi.ch

9 months ago

In reply to: Tomas Vondra (#90)

#114